Extract a part of a changeabel string

Extract a part of a changeabel string - r

I have a simple but yet complicated question (at least for me)!
I would like to extract a part of a string like in this example:
From this string:
name <- "C:/Users/admin/Desktop/test/plots/"
To this:
name <- "test/plots/"
The plot twist for my problem that the names are changing. So its not always "test/plots/", it could be "abc/ccc/" or "m.project/plots/" and so on.
In my imagination I would use something to find the last two "/" in the string and cut out the text parts. But I have no idea how to do it!
Thank you for your help and time!

Without regex
Use str_split to split your path by /. Then extract the first three elements after reversing the string, and paste back the / using the collapse argument.
library(stringr)
name <- "C:/Users/admin/Desktop/m.project/plots/"
paste0(rev(rev(str_split(name, "\\/", simplify = T))[1:3]), collapse = "/")
[1] "m.project/plots/"
With regex
Since your path could contain character/numbers/symbols, [^/]+/[^/]+/$ might be better, which matches anything that is not /.
library(stringr)
str_extract(name, "[^/]+/[^/]+/$")
[1] "m.project/plots/"

With {stringr}, assuming the path comprises folders with lower case letters only. You could adjust the alternatives in the square brackets as required for example if directory names include a mix of upper and lower case letters use [.A-z]
Check a regex reference for options:
name <- c("C:/Users/admin/Desktop/m.project/plots/",
"C:/Users/admin/Desktop/test/plots/")
library(stringr)
str_extract(name, "[.a-z]+/[.a-z]+/$")
#> [1] "m.project/plots/" "test/plots/"
Created on 2022-03-22 by the reprex package (v2.0.1)

Related

R: How to extract everything after first occurence of a dot (.)

I need to extract from a string such as outside.HLA.DR.highpass the part after the first dot, yielding HLA.DR.highpass.
Importantly, the middle part of the string, outside.xxx.highpass might or might not have additional dots, e.g. outside.CD19.highpass should yield in CD19.highpass as well.
I got similar steps where extraction of the first part I do with sub(".[^.]+$", "", "outside.HLA.DR.highpass" ) to return "outside.HLA.DR". However, I fail to adapt it so that it returns only the part of the string after the first dot?
any help is greatly appreciated!!

Your solution for extraction of the first area is correct. Simply apply a similar rule:
sub("^[^.]+.","","outside.HLA.DR.highpass")
Should return the desired string.

You want to use a non-greedy regex operator here to capture the start of the sentence (^), followed by the fewest possible characters (.*?) followed by a literal dot (\\.)
sub("^.*?\\.", "", "outside.HLA.DR.highpass")
# "HLA.DR.highpass"

Here's a solution with stringr that will work on a vector.
library(stringr)
s <- c("outside.HLA.DR.highpass", "outside.CD19.highpass")
str_sub(
s,
start = str_locate(s, fixed("."))[, 1] + 1,
end = str_length(s)
)
#> [1] "HLA.DR.highpass" "CD19.highpass"
Created on 2022-07-13 by the reprex package (v2.0.1)

Regex for extracting string from csv before numbers

I'm very new to the regex world and would like to know how to extract strings using regex from a bunch of file names I've imported to R. My files follow the general format of:
testing1_010000.csv
check3_012000.csv
testing_checking_045880.csv
test_check2_350000.csv
And I'd like to extract everything before the 6 numbers.csv part, including the "_" to get something like:
testing1_
check3_
testing_checking_
test_check2_
If it helps, the pattern I essentially want to remove will always be 6 numbers immediately followed by .csv.
Any help would be great, thank you!

There's a few ways you could go about this. For example, match anything before a string of six digits followed by ".csv". For this one you would want to get the first capturing group.
/(.*)\d{6}.csv/
https://regex101.com/r/MPH6mE/1/
Or match everything up to the last underscore character. For this one you would want the whole match.
.*_
https://regex101.com/r/4GFPIA/1

Files = c("testing1_010000.csv", "check3_012000.csv",
"testing_checking_045880.csv", "test_check2_350000.csv")
sub("(.*_)[[:digit:]]{6}.*", "\\1", Files)
[1] "testing1_" "check3_" "testing_checking_"
[4] "test_check2_"

We can use stringr::str_match(). It will also work for different that six digits.
library(tidyverse)
files <- c("testing1_010000.csv", "check3_012000.csv", "testing_checking_045880.csv", "test_check2_350000.csv")
str_match(files, '(.*_)\\d+\\.csv$')[, 2]
#> [1] "testing1_" "check3_" "testing_checking_"
#> [4] "test_check2_"
The regex can be interpreted as:
"capture everything before and including an underscore, that is then followed by one or more digits .csv as an ending"
Created on 2021-12-03 by the reprex package (v2.0.1)

Using nchar:
Files = c("testing1_010000.csv", "check3_012000.csv",
"testing_checking_045880.csv", "test_check2_350000.csv")
substr(Files, 1, nchar(Files)-10)
OR
library(stringr)
str_remove(Files, "\\d{6}.csv")
[1] "testing1_" "check3_" "testing_checking_"
[4] "test_check2_"

Extract character string in middle of string with R

I have character strings which look something like this:
a <- c("miRNA__hsa-mir-521-3p.iso.t5:", "miRNA__hsa-mir-947b.ref.t5:")
I want to extract the middle portion only eg. hsa-mir-521-3p and hsa-mir-947b
I have tried the following so far:
a1 <- substr(a, 8,21)
[1] "hsa-mir-521-3p" "hsa-mir-947b.r"
this obviously does not work because my desired substrings have varying lengths
a2 <- sub('miRNA__', '', a)
[1] "hsa-mir-521-3p.iso.t5:" "hsa-mir-947b.ref.t5:"
this works to remove the upstream string (“miRNA__”), but I still need to remove the downstream string
Could someone please advise what else I could try or if there is a simpler way to achieve this? I am still learning how to code with R. Thank you very much!

You haven't clearly defined the "middle portion" but based on the data shared we can extract everything between the last underscore ("_") and a dot (".").
sub('.*_(.*?)\\..*', '\\1', a)
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

You can try the following regex like below
> gsub(".*_|\\..*","",a)
[1] "hsa-mir-521-3p" "hsa-mir-947b"
which removes the left-most (.*_) and right-most (\\..*) parts, therefore keeping the middle part.

We could also use trimws from base R
trimws(a, whitespace = '.*_|\\..*')
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

Extract shortest matching string regex

Minimal Reprex
Suppose I have the string as1das2das3D. I want to extract everything from the letter a to the letter D. There are three different substrings that match this - I want the shortest / right-most match, i.e. as3D.
One solution I know to make this work is stringr::str_extract("as1das2das3D", "a[^a]+D")
Real Example
Unfortunately, I can't get this to work on my real data. In my real data I have string with (potentially) two URLs and I'm trying to extract the one that's immediately followed by rel=\"next\". So, in the below example string, I'd like to extract the URL https://abc.myshopify.com/ZifQ.
foo <- "<https://abc.myshopify.com/YifQ>; rel=\"previous\", <https://abc.myshopify.com/ZifQ>; rel=\"next\""
# what I've tried
stringr::str_extract(foo, '(?<=\\<)https://.*(?=\\>; rel\\="next)') # wrong output
stringr::str_extract(foo, '(?<=\\<)https://(?!https)+(?=\\>; rel\\="next)') # error

You could do:
stringr::str_extract(foo,"https:[^;]+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
or even
stringr::str_extract(foo,"https(?:(?!https).)+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"

Would this be an option?
Splitting string on ; or , comparing it with target string and take url from its previous index.
urls <- strsplit(foo, ";\\s+|,\\s+")[[1]]
urls[which(urls == "rel=\"next\"") - 1]
#[1] "<https://abc.myshopify.com/ZifQ>"

Here may be an option.
gsub(".+\\, <(.+)>; rel=\"next\"", "\\1", foo, perl = T)
#[1] "https://abc.myshopify.com/ZifQ"

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)

Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.

I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.