stringr regex for first full *.zip filename - r

I have the following code:
test_zip_col <- "daily_44201_2015.zip259,151 Rows2,958 KBAs of 2015-11-27"
test_zip_col2 <- str_extract(test_zip_col, '^*\\.zip$')
test_zip_col
test_zip_col2
I want to extract the first occurence of the *.zip filename. In this example, I wish to extract:
"daily_44201_2015.zip"
Could anyone please explain how to amend my str_extract code so that it does not produce an NA value?

library(stringr)
test_zip_col <- "daily_44201_2015.zip259,151 Rows2,958 KBAs of 2015-11-27"
loc<-str_locate(test_zip_col,".zip") ## Locate the ".zip"
str_sub(test_zip_col,start=1, end=loc[,2]) # Substring
[1] "daily_44201_2015.zip"

We could use sub
sub('(.*\\.zip).*', '\\1', test_zip_col)
#[1] "daily_44201_2015.zip"

Related

Extract a number from a string which precedes a phrase in R

I am in R and would like to extract a two digit number 38y from the following string:
"/Users/files/folder/file_number_23a_version_38y_Control.txt"
I know that _Control always comes after the 38y and that 38y is preceded by an underscore. How can I use strsplit or other R commands to extract the 38y?
You could use
regmatches(x, regexpr("[^_]+(?=_Control)", x, perl = TRUE))
# [1] "38y"
or equivalently
stringr::str_extract(x, "[^_]+(?=_Control)")
# [1] "38y"
Using gsub.
gsub('.*_(.*)_Control.*', '\\1', x)
# [1] "38y"
See demo with detailed explanation.
A possible solution:
library(stringr)
text <- "/Users/files/folder/file_number_23a_version_38y_Control.txt"
str_extract(text, "(?<=_)\\d+\\D(?=_Control)")
#> [1] "38y"
You can find an explanation of the regex part at:
https://regex101.com/r/PQSZHX/1

Add a character to a specific part of a string?

I have a list of file names as such:
"A/B/file.jpeg"
"A/C/file2.jpeg"
"B/C/file3.jpeg"
and a couple of variations of such.
My question is how would I be able to add a "new" or any characters into each of these file names after the second "/" such that the length of the string/name doesn't matter just that it is placed after the second "/"
Results would ideally be:
"A/B/newfile.jpeg"
"A/B/newfile2.jpeg" etc.
Thanks!
Another possible solution, based on stringr::str_replace:
library(stringr)
l <- c("A/B/file.jpeg", "A/B/file2.jpeg", "A/B/file3.jpeg")
str_replace(l, "\\/(?=file)", "\\/new")
#> [1] "A/B/newfile.jpeg" "A/B/newfile2.jpeg" "A/B/newfile3.jpeg"
Using gsub.
gsub('(file)', 'new\\1', x)
# [1] "A/B/newfile.jpeg" "A/C/newfile2.jpeg" "B/C/newfile3.jpeg"
Data:
x <- c("A/B/file.jpeg", "A/C/file2.jpeg", "B/C/file3.jpeg")

append letter to a string in r

I have a vector:
c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL")
And I would like to remove "BAA" from the words that contain it. And to those words I would like to append ".PR".
Desired outcome:
c("AVAST.PR", "CEZ.PR", "GECBA.PR", "LOL")
Any ideas? Ideally using stringr. Thank you a lot.
You could use the following solution:
gsub("BAA(.*)", "\\1\\.PR", vec)
[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
You could use
library(stringr)
# optimized thanks to Anoushiravan
str_replace(c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL"), "BAA(\\w*)", "\\1.PR")
#> [1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
use \\w* if you want to match word characters only or .* if there are no limitations to the characters.
This is verbose than the other answers. It finds strings with 'BAA' and appends 'PR.' to it.
inds <- grepl('BAA', vec, fixed = TRUE)
vec[inds] <- paste(sub('BAA', '', vec[inds]), 'PR', sep = '.')
vec
#[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"

Find pattern in URL with stringr and regex

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract
My data looks like
Text URL
Hello www.facebook.com/group1/bla/exy/1234
Test www.facebook.com/group2/fssas/eda/1234
Text www.facebook.com/group-sdja/sdsds/adeds/23234
Texter www.facebook.com/blablabla/sdksds/sdsad
I now want to extract everything after .com/ and the next /
I tried suburlpattern <- "^.com//{1,20}//$"
and df$categories <- str_extract(df$URL, suburlpattern)
But I only end up with NA in df$categories
Any idea what I am doing wrong here? Is it my regex code?
Any help is highly appreciated! Many thanks beforehand.
If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:
(?<=[.]com/)[^/]+
See the regex demo.
Details:
(?<=[.]com/) - the current location must be preceded with .com/ substring
[^/]+ - matches 1 or more characters other than /.
R demo:
> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1" "group2" "group-sdja" "blablabla"
this will return everything between the first set of forward slashes
library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]
[1] "blablabla"
This works
library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234",
"www.facebook.com/group2/fssas/eda/1234",
"www.facebook.com/group-sdja/sdsds/adeds/23234",
"www.facebook.com/blablabla/sdksds/sdsad")
suburlpattern <- "/(.*?)/"
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)
Results:
[1] "group1" "group2" "group-sdja" "blablabla"
Will only get you what's between the first and second slashes... but that seems to be what you want.

Find first matching substring in a long string in R

I'm trying to find the first matching string from a vector in a long string. I have for example a example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow' and a matching_vector<- c('Turtle',Dog') Now I want that it returns 'Dog' as this is the first substring in the matching_vector that we see in the example string: LionabcdBear1231DogextKittyisananimalTurtleisslow
I already tried pmatch(example_string,matching_vector) but it doesn't work. Obviously as it doesn't work with substrings...
Thanks!
Tim
Is the following solution working for you?
example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow'
matching_vector<- c('Turtle','Dog')
match_ids <- sapply(matching_vector, function(x) regexpr(x ,example_string))
result <- names(match_ids)[which.min(match_ids)]
> result
[1] "Dog"
We can use stri_match_first from stringi
library(stringi)
stri_match_first(example_string, regex = paste(matching_vector, collapse="|"))

Resources