Extract character string in middle of string with R - r

I have character strings which look something like this:
a <- c("miRNA__hsa-mir-521-3p.iso.t5:", "miRNA__hsa-mir-947b.ref.t5:")
I want to extract the middle portion only eg. hsa-mir-521-3p and hsa-mir-947b
I have tried the following so far:
a1 <- substr(a, 8,21)
[1] "hsa-mir-521-3p" "hsa-mir-947b.r"
this obviously does not work because my desired substrings have varying lengths
a2 <- sub('miRNA__', '', a)
[1] "hsa-mir-521-3p.iso.t5:" "hsa-mir-947b.ref.t5:"
this works to remove the upstream string (“miRNA__”), but I still need to remove the downstream string
Could someone please advise what else I could try or if there is a simpler way to achieve this? I am still learning how to code with R. Thank you very much!

You haven't clearly defined the "middle portion" but based on the data shared we can extract everything between the last underscore ("_") and a dot (".").
sub('.*_(.*?)\\..*', '\\1', a)
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

You can try the following regex like below
> gsub(".*_|\\..*","",a)
[1] "hsa-mir-521-3p" "hsa-mir-947b"
which removes the left-most (.*_) and right-most (\\..*) parts, therefore keeping the middle part.

We could also use trimws from base R
trimws(a, whitespace = '.*_|\\..*')
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

Related

Extract a part of a changeabel string

I have a simple but yet complicated question (at least for me)!
I would like to extract a part of a string like in this example:
From this string:
name <- "C:/Users/admin/Desktop/test/plots/"
To this:
name <- "test/plots/"
The plot twist for my problem that the names are changing. So its not always "test/plots/", it could be "abc/ccc/" or "m.project/plots/" and so on.
In my imagination I would use something to find the last two "/" in the string and cut out the text parts. But I have no idea how to do it!
Thank you for your help and time!
Without regex
Use str_split to split your path by /. Then extract the first three elements after reversing the string, and paste back the / using the collapse argument.
library(stringr)
name <- "C:/Users/admin/Desktop/m.project/plots/"
paste0(rev(rev(str_split(name, "\\/", simplify = T))[1:3]), collapse = "/")
[1] "m.project/plots/"
With regex
Since your path could contain character/numbers/symbols, [^/]+/[^/]+/$ might be better, which matches anything that is not /.
library(stringr)
str_extract(name, "[^/]+/[^/]+/$")
[1] "m.project/plots/"
With {stringr}, assuming the path comprises folders with lower case letters only. You could adjust the alternatives in the square brackets as required for example if directory names include a mix of upper and lower case letters use [.A-z]
Check a regex reference for options:
name <- c("C:/Users/admin/Desktop/m.project/plots/",
"C:/Users/admin/Desktop/test/plots/")
library(stringr)
str_extract(name, "[.a-z]+/[.a-z]+/$")
#> [1] "m.project/plots/" "test/plots/"
Created on 2022-03-22 by the reprex package (v2.0.1)

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

Extract shortest matching string regex

Minimal Reprex
Suppose I have the string as1das2das3D. I want to extract everything from the letter a to the letter D. There are three different substrings that match this - I want the shortest / right-most match, i.e. as3D.
One solution I know to make this work is stringr::str_extract("as1das2das3D", "a[^a]+D")
Real Example
Unfortunately, I can't get this to work on my real data. In my real data I have string with (potentially) two URLs and I'm trying to extract the one that's immediately followed by rel=\"next\". So, in the below example string, I'd like to extract the URL https://abc.myshopify.com/ZifQ.
foo <- "<https://abc.myshopify.com/YifQ>; rel=\"previous\", <https://abc.myshopify.com/ZifQ>; rel=\"next\""
# what I've tried
stringr::str_extract(foo, '(?<=\\<)https://.*(?=\\>; rel\\="next)') # wrong output
stringr::str_extract(foo, '(?<=\\<)https://(?!https)+(?=\\>; rel\\="next)') # error
You could do:
stringr::str_extract(foo,"https:[^;]+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
or even
stringr::str_extract(foo,"https(?:(?!https).)+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
Would this be an option?
Splitting string on ; or , comparing it with target string and take url from its previous index.
urls <- strsplit(foo, ";\\s+|,\\s+")[[1]]
urls[which(urls == "rel=\"next\"") - 1]
#[1] "<https://abc.myshopify.com/ZifQ>"
Here may be an option.
gsub(".+\\, <(.+)>; rel=\"next\"", "\\1", foo, perl = T)
#[1] "https://abc.myshopify.com/ZifQ"

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

Resources