How to keep only information inside a complex string in R? - r

I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \" and \" in Function=\"SMAD5\". I also want to keep the empty strings: Function=\"\"
df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";",
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";",
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";",
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";",
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")
This should look like this:
"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA
So far that What I was able to do:
gsub('.*Function=\"',"",df)
[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";" "\";" "\";"
But I'm stuck with a bunch of \";". How can I remove them with one line?
I tried this:
gsub('.*Function=\"' & '.\"*',"",test)
But it's giving me this error:
Error in ".*Function=\"" & ".\"*" :
operations are possible only for numeric, logical or complex types

You may use
gsub(".*Function=\"([^\"]*).*","\\1",df)
See the regex demo
Details:
.* - any 0+ chars as many as possible up to the last...
Function=\" - a Function=" substring
([^\"]*) - capturing group 1 matching 0+ chars other than a "
.* - and the rest of the string.
The \1 is the backreference restoring the contents of the Group 1 in the result.

With stringr we can capture groups too:
library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA NA NA

The regular expression can be constructed more readably using rebus.
rx <- 'Function="' %R%
capture(zero_or_more(negated_char_class('"')))
Then matching is as mentioned by Wiktor and sandipan.
rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)
gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)

Related

How to replace `qux$foo$bar` with `qux[["foo"]][["bar]]`? (dollar subsetting to brackets subsetting)

In a R script (that I would read with readLines), I want to replace every occurence of qux$foo$bar with qux[["foo"]][["bar]]. But I'm not a regex master.
I started with this regex:
> gsub("(\\w*)(\\$)(\\w*)", '\\1[["\\3"]]', "qux$foo$bar; input$test$a$a") %>% cat
qux[["foo"]][["bar"]]; input[["test"]][["a"]][["a"]]
Nice. But I also want to handle the case of backticks. So I tried:
> gsub("(\\w*)(\\$)`{0,1}(\\w*)`{0,1}", '\\1[["\\3"]]', "qux$`foo`; bar$`baz`; x$uvw") %>% cat
qux[["foo"]]; bar[["baz"]]; x[["uvw"]]
Looks correct. But between the backticks, there could be a space, and the previous way does not work in this case. So I tried the following, which neither does not work:
gsub("(\\w*)(\\$)`{0,1}(.*)`{0,1}", '\\1[["\\3"]]', "qux$`fo o`") %>% cat
qux[["fo o`"]]
Could you help to find the right regex pattern? It seems that instead of \\w I need something which means match a "word that can contain spaces".
You can use
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
See the regex #1 demo and regex #2 demo. Details:
(\w*) - Group 1 (\1): zero or more word chars
(?|$`([^`]*)`|$([^\s$]+)) - a branch reset group matching either
$`([^`]*)` - $, backtick, Group 2 (\2) capturing zero or more non-backtick chars, and a backtick.
| - or
$([^\s$]+) - $, then Group 2 (\2) capturing one or more chars other than whitespace and $
See the R demo:
x <- c('qux$foo$bar','qux$foo$bar; input$test$a$a','qux$`foo`; bar$`baz`; x$uvw','qux$`fo o`', 'q_ux$f_o_o$b.a_r')
gsub('(\\w*)(?|\\$`([^`]*)`|\\$([^\\s$]+))', '\\1[["\\2"]]', x, perl=TRUE)
## Or
## gsub('\\$`([^`]*)`|\\$([^\\s$]+)', '[["\\1\\2"]]', x, perl=TRUE)
Output:
[1] "qux[[\"foo\"]][[\"bar\"]]"
[2] "qux[[\"foo\"]][[\"bar;\"]] input[[\"test\"]][[\"a\"]][[\"a\"]]"
[3] "qux[[\"foo\"]]; bar[[\"baz\"]]; x[[\"uvw\"]]"
[4] "qux[[\"fo o\"]]"
[5] "q_ux[[\"f_o_o\"]][[\"b.a_r\"]]"
Note: backslashes in the output are console artifacts to keep the double quoted strings valid string literals, they are not part of the plain text output.
You might repeat optional spaces before and after matching 1 or more word characters.
You don't need a capture group for the $ but instead you could use a capture group to pair up the backtick in case it is there or not using a backreference to group 2.
To repeat 0+ whitespace chars you can also use \s but that could also match a newline.
Note that \w* matches optional word chars, and {0,1} can be written as ?
(\w*)\$(`?)( *\w+(?: +\w+)* *)\2
The pattern matches:
(\w*) Capture group 1 Match optional word characters
\$ Match $
(`?) Capture group 2, optionally match a backtick
( *\w+(?: +\w+)* *) Capture group 3 Match repetitions of word characters between spaces
\2 Backreference to what is captured in group 2 (yes or no backtick)
Regex demo
gsub("(\\w*)\\$(`?)( *\\w+(?: +\\w+)* *)\\2", '\\1[["\\3"]]', "qux$fo o$bar", perl=TRUE)
Output
[1] "qux[[\"fo o\"]][[\"bar\"]]"

Retain string up to second slash in regex?

I am trying to only retain the string after the first section of characters (which includes - and numerics) but before the forward slash.
I have the following string:
x <- c('/youtube.com/videos/cats', '/google.com/images/dogs', 'bbc.com/movies')
/youtube.com/videos/cats
/google.com/images/dogs
bbc.com/movies
So it would look like this
/youtube.com/
/google.com/
bbc.com/
For reference I am using R 3.6
I have tried positive lookbehinds and the closest I got was this: ^\/[^\/]*
Any help appreciated
So in the bbc.com/movies example - the string does not start with a forward slash / but I still want to be able to keep the bbc.com part during the match
You can use a sub here to only perform a single regex replacement:
sub('^(/?[^/]*/).*', '\\1', x)
See the regex demo.
Details
^ - start of string
-(/?[^/]*/) - Capturing group 1 (\1 in the replacement pattern): an optional /, then 0 or more chars other than / and then a /
.* - any zero or more chars, as many as possible.
See an R test online:
test <- c("/youtube.com/videos/cats", "/google.com/images/dogs", "bbc.com/movies")
sub('^(/?[^/]*/).*', '\\1', test)
# => [1] "/youtube.com/" "/google.com/" "bbc.com/"
First great username. Try this, you can leverage the fact str_extract only pulls the first match out. assuming all urls match letters.letters this pattern should work. Let me know if you have numbers in any of them.
library(stringr)
c("/youtube.com/videos/cats",
"/google.com/images/dogs",
"bbc.com/movies") %>%
str_extract(., "/?\\w+\\.\\w+/")
produces
"/youtube.com/" "/google.com/" "bbc.com/"
Using base R
gsub('(\\/?.*\\.com\\/).*', '\\1', x)
[1] "/youtube.com/" "/google.com/" "bbc.com/"
an alternative would be with the rebus Package:
library(rebus)
library(stringi)
t <- c("/youtube.com/videos/cats"," /google.com/images/dogs"," bbc.com/movie")
pattern <- zero_or_more("/") %R% one_or_more(ALPHA) %R% DOT %R% one_or_more(ALPHA) %R% zero_or_more("/")
stringi::stri_extract_first_regex(t, pattern)
[1] "/youtube.com/" "/google.com/" "bbc.com/"

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Resources