stringr str_extract capture group capturing everything - r

I'm looking to extract the year from a string. This always comes after an 'X' and before "." then a string of other characters.
Using stringr's str_extract I'm trying the following:
year = str_extract(string = 'X2015.XML.Outgoing.pounds..millions.'
, pattern = 'X(\\d{4})\\.')
I thought the brackets would define the capture group, returning 2015, but I actually get the complete match X2015.
Am I doing this correctly? Why am i not trimming "X" and "."?

The capture group is irrelevant in this case. The function str_extract will return the whole match including characters before and after the capture group.
You have to work with lookbehind and lookahead instead. Their length is zero.
library(stringr)
str_extract(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = '(?<=X)\\d{4}(?=\\.)')
# [1] "2015"
This regex matches four consecutive digits that are preceded by an X and followed by a ..

I believe the most idiomatic way is to use str_match:
str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = 'X(\\d{4})\\.')
Which returns the complete match followed by capture groups:
[,1] [,2]
[1,] "X2015." "2015"
As such the following will do the trick:
str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = 'X(\\d{4})\\.')[2]

Alternatively, you can use gsub:
string = 'X2015.XML.Outgoing.pounds..millions.'
gsub("X(\\d{4})\\..*", "\\1", string)
# [1] "2015"
or str_replace from stringr:
library(stringr)
str_replace(string, "X(\\d{4})\\..*", "\\1")
# [1] "2015"

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

R how to match and extract character letters of different length in a string

So I have a column of contract names df$name like below
FB210618C00280000
ADM210618C00280000
M210618P00280000
I would like to extract the FB, ADM and M. That is I want to extract characters in the string and they are of different length and stop once the first number occurs, and I don't want to extract the C or P.
The below code will give me the C or P
stri_extract_all_regex(df$name, "[a-z]+")
We can use stri_extract_first from stringi
library(stringi)
stri_extract_first(df$name, regex = "[A-Z]+")
#[1] "FB" "ADM" "M"
Or we can use base R with sub
sub("\\d+.*", "", df$name)
#[1] "FB" "ADM" "M"
Or use trimws from base R
trimws(df$name, whitespace = "\\d+.*")
data
df <- data.frame(name = c("FB210618C00280000", "ADM210618C00280000",
"M210618P00280000"))
You can use
library(stringr)
str_extract(df$name, "^[A-Za-z]+")
# Or
str_extract(df$name, "^\\p{L}+")
The stringr::str_extract function will extract the first occurrence of a pattern and ^[A-Za-z]+ / ^\p{L}+ regex matches one or more letters at the start of the string. Note \p{L} matches any Unicode letters.
See the regex demo.
Same pattern can be used with stringi::stri_extract_first():
library(stringi)
stri_extract_first(df$name, regex="^[A-Za-z]+")

How to extract string after 2nd delimiter in R

I have my vector as
dt <- c("1:7984985:A:G", "1:7984985-7984985:A:G", "1:7984985-7984985:T:G")
I would like to extract everything after 2nd :.
The result I would like is
A:G , A:G, T:G
What would be the solution for this?
We can use sub to match two instances of one or more characters that are not a : ([^:]+) followed by : from the start (^) of the string and replace it with blank ("")
sub("^([^:]+:){2}", "", dt)
#[1] "A:G" "A:G" "T:G"
It can be also done with trimws (if it is not based on position)
trimws(dt, whitespace = "[-0-9:]")
#[1] "A:G" "A:G" "T:G"
Or using str_remove from stringr
library(stringr)
str_remove(dt, "^([^:]+:){2}")
#[1] "A:G" "A:G" "T:G"
You can use sub, capture the items you want to retain in a capturing group (...) and refer back to them in the replacement argument to sub:
sub("^.:[^:]+:(.:.)", "\\1", dt, perl = T)
[1] "A:G" "A:G" "T:G"
Alternatively, you can use str_extract and positive lookbehind (?<=...):
library(stringr)
str_extract(dt, "(?<=:)[A-Z]:[A-Z]")
[1] "A:G" "A:G" "T:G"
Or simply use str_split which returns a list of 2 values.
´str_split("1:7984985:A:G", "\:",n=3)[[1]][3]´

String replace with regex condition

I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"
Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Resources