How to extract everything until first occurrence of pattern - r

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.

To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"

The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"

Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")

You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Related

Get substring before the second capital letter

Is there an R function to get only the part of a string before the 2nd capital character appears?
For example:
Example <- "MonkeysDogsCats"
Expected output should be:
"Monkeys"
Maybe something like
stringr::str_extract("MonkeysDogsCats", "[A-Z][a-z]*")
#[1] "Monkeys"
Here is an alternative approach:
Here we first put a space before all uppercase and then extract the first word:
library(stringr)
word(gsub("([a-z])([A-Z])","\\1 \\2", Example), 1)
[1] "Monkeys"
A base solution with sub():
x <- "MonkeysDogsCats"
sub("(?<=[a-z])[A-Z].*", "", x, perl = TRUE)
# [1] "Monkeys"
Another way using stringr::word():
stringr::word(x, 1, sep = "(?=[A-Z])\\B")
# [1] "Monkeys"
If the goal is strictly to capture any string before the 2nd capital character, one might want pick a solution it'll also work with all types of strings including numbers and special characters.
strings <- c("MonkeysDogsCats",
"M4DogsCats",
"M?DogsCats")
stringr::str_remove(strings, "(?<=.)[A-Z].*")
Output:
[1] "Monkeys" "M4" "M?"
It depends on what you want to allow to match. You can for example match an uppercase char [A-Z] optionally followed by any character that is not an uppercase character [^A-Z]*
If you don't want to allow whitespace chars, you can exclude them [^A-Z\\s]*
library(stringr)
str_extract("MonkeysDogsCats", "[A-Z][^A-Z]*")
Output
[1] "Monkeys"
R demo
If there should be an uppercase character following, and there are only lowercase characters allowed:
str <- "MonkeysDogsCats"
regmatches(str, regexpr("[A-Z][a-z]*(?=[A-Z])", str, perl = TRUE))
Output
[1] "Monkeys"
R demo

Is it possible to use R's base::strsplit() without consuming pattern

I have a string that consists entirely of simple repeating patterns of a [:digit:]+[A-Z] for instance 12A432B4B.
I want to to use base::strsplit() to get:
[1] "12A" "432B" "4B"
I thought I could use lookahead to split by a LETTER and keep this pattern with unlist(strsplit("12A432B4B", "(?<=.)(?=[A-Z])", perl = TRUE)) but as can be seen I get the split wrongly:
[1] "12" "A432" "B4" "B"
Cant get my mind around a pattern that works with this strsplit strategy? Explanations would be really appreciated.
Bonus:
I also failed to use back reference in gsub (e.g. - pattern not working `gsub("([[:digit:]]+[A-Z])+", "\\1", "12A432B4B"), and can you retrieve more than \\1 to \\9 groups, say if [:digit:]+[A-Z] repeats for more than 9 times ?
We can use regex lookaround to split between an upper case letter and a digit
strsplit(str1, "(?<=[A-Z])(?=[0-9])", perl = TRUE)[[1]]
#[1] "12A" "432B" "4B"
data
str1 <- "12A432B4B"
The pattern mentioned in the post can be used as it is in str_extract_all :
str_extract_all(string, '[[:digit:]]+[A-Z]')[[1]]
#[1] "12A" "432B" "4B"
Or in base R :
regmatches(string, gregexpr('[[:digit:]]+[A-Z]', string))[[1]]
where string is :
string <- '12A432B4B'

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

Use gsub to keep only the first part of my string

In R, I have strings looking like:
test <- 'ZYG11B|79699'
I want to keep only 'ZYG11B'.
My best attempt yet:
gsub ("|.*$", "", test) # should replace everything after '|' by nothing
but returns
> [1] ""
How should I do that?
It's a protected character which means it should be enclosed in square brackets or escaped with double slashes:
> gsub('[|].*$','', test)
[1] "ZYG11B"
> gsub('\\|.*$','', test)
[1] "ZYG11B"
We can do
library(stringr)
str_extract(test, "\\w+")
#[1] "ZYG11B"

Extract characters between specified characters in R

I have this variable
x= "379_exp_mirror1.csv"
I need to extract the number ("379") at the beggining (which doesn't always have 3 characters), i.e. everything before the first "". And then I need to extract everything between the second "" and the ".", in this case "mirror1".
I have tried several combinations with sub and gsub with no success, can anyone give me some indications please?
Thank you
You can use regular expression. For your problem ^(?<Number>[0-9]*)_.* do the job
1/ Test your regular expression with this website : http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Or you can split string with underscore and then try parse (int.TryParse). I think the second is better but if you want to be a regular expression master try the first method
You can use sub to extract the substrings:
x <- "379_exp_mirror1.csv"
sub("_.*", "", x)
# [1] "379"
sub("^(?:.*_){2}(.*?)\\..*", "\\1", x)
# [1] "mirror1"
Another approach with gregexpr:
regmatches(x, gregexpr("^.*?(?=_)|(?<=_)[^_]*?(?=\\.)", x, perl = TRUE))[[1]]
# [1] "379" "mirror1"
May be you can try:
library(stringr)
x <- "379_exp_mirror1.csv"
str_extract_all(x, perl('^[0-9]+(?=_)|[[:alnum:]]+(?=\\.)'))[[1]]
#[1] "379" "mirror1"
Or
strsplit(x, "[._]")[[1]][c(T,F)]
#[1] "379" "mirror1"
Or
scan(text=gsub("[.]","_", x),what="",sep="_")[c(T,F)]
#Read 4 items
#[1] "379" "mirror1"

Resources