Return a string between two characters '.' - r

I have column names similar to the following
names(df_woe)
# [1] "A_FLAG" "woe.ABCD.binned" "woe.EFGHIJ.binned"
...
I would like to rename the columns by removing the "woe." and ".binned" sections, so that the following will be returned
names(df_woe)
# [1] "A_FLAG" "ABCD" "EFGHIJ"
...
I have tried substr(names(df_woe), start, stop) but I am unsure how to set variable start/stop arguments.

Another possible and readable regex can be to create groups and return the group after the first and before the second dot, i.e.
gsub("(.*\\.)(.*)\\..+", "\\2", names(df_woe))
#[1] "A_FLAG" "ABCD" "EFGH"

nam <- c("A_FLAG", "woe.ABCD.binned", "woe.EFGH.binned")
gsub("woe\\.|\\.binned", "", nam)
[1] "A_FLAG" "ABCD" "EFGH"
EDIT: a solution that deals with wierder cases such as woe..binned.binned
gsub("^woe\\.|\\.binned$", "", nam)

Another solution, using stringr package:
str_replace_all("woe.ABCD.binned", pattern = "woe.|.binned", replacement = "")
# [1] "ABCD"

Related

Extract a string that spans across multiple lines - stringr

I need to extract a string that spans across multiple lines on an object.
The objetc:
> text <- paste("abc \nd \ne")
> cat(text)
abc
d
e
With str_extract_all I can extract all the text between ‘a’ and ‘c’, for example.
> str_extract_all(text, "a.*c")
[[1]]
[1] "abc"
Using the function ‘regex’ and the argument ‘multiline’ set to TRUE, I can extract a string across multiple lines. In this case, I can extract the first character of multiple lines.
> str_extract_all(text, regex("^."))
[[1]]
[1] "a"
> str_extract_all(text, regex("^.", multiline = TRUE))
[[1]]
[1] "a" "d" "e"
But when I try the to extract "every character between a and d" (a regex that spans across multiple lines), the output is "character(0)".
> str_extract_all(text, regex("a.*d", multiline = TRUE))
[[1]]
character(0)
The desired output is:
“abcd”
How to get it with stringr?
dplyr:
library(dplyr)
library(stringr)
data.frame(text) %>%
mutate(new = lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
text new
1 abc \nd \ne abcd
Here we use the character class \\w, which does not include the new line metacharacter \n. The negative lookahead (?!e) makes sure the e is not matched.
base R:
unlist(lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
[1] "abcd"
str_remove_all(text,"\\s\\ne?")
[1] "abcd"
OR
paste0(trimws(strsplit(text, "\\ne?")[[1]]), collapse="")
[1] "abcd"
The anwers above remove line breaks. So, a two step approach can work to get the desired output 'abcd'.
1 - Use str_remove_all or gsub to remove the line breaks (in this case, also removing blank spaces).
2 - Use str_extract_all to get the desired output ('abcd' in this case).
> text %>%
+ str_remove_all("\\s\\n") %>%
+ str_extract_all("a.*d")
[[1]]
[1] "abcd"
Short regex reference:
\n - new line (return)
\s - any whitespace
\r - carriage return
Update:
In base R to get the desired output abcd:
text <- gsub("[\r\n]|[[:blank:]]", "", text)
substr(text,1, nchar(text)-1)
[1] "abcd"
First answer:
We can use gsub:
gsub("[\r\n]|[[:blank:]]", "", text)
[1] "abcde"

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

String between first two (.dots)

Hi have data which contains two or more dots. My requirement is to get string from first to second dot.
E.g string <- "abcd.vdgd.dhdsg"
Result expected =vdgd
I have used
pt <-strapply(string, "\\.(.*)\\.", simplify = TRUE)
which is giving correct data but for string having more than two dots its not working as expected.
e.g string <- "abcd.vdgd.dhdsg.jsgs"
its giving dhdsg.jsgs but expected is vdgd
Could anyone help me.
Thanks & Regards,
In base R we can use strsplit
ss <- "abcd.vdgd.dhdsg"
unlist(strsplit(ss, "\\."))[2]
#[1] "vdgd"
Or using gregexpr with regmatches
unlist(regmatches(ss, gregexpr("[^\\.]+", ss)))[2]
#[1] "vdgd"
Or using gsub (thanks #TCZhang)
gsub("^.+?\\.(.+?)\\..*$", "\\1", ss)
#[1] "vdgd"
Another option:
string <- "abcd.vdgd.dhdsg.jsgs"
library(stringr)
str_extract(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[1] "vdgd"
I like this one because the str_extract function will return the first instance of the correct pattern, but you could also use str_extract_all to get all instances.
str_extract_all(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[[1]]
[1] "vdgd" "dhdsg"
From here, you could index to get any position between two dots you want.
Another solution with the qdapRegex package:
library(qdapRegex)
ex_between("abcd.vdgd.dhdsg.jsgs", ".", ".")[[1]][1]
# "vdgd"
You can use read.table as well if you wish.Here providing the string as given in your problem and selecting the separator as dot("."), Once the column is converted into a data.frame, you may choose to select whatever column you want to pick(In this case it is column number 2).
read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
Output:
> read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
[1] "vdgd"
Here is a fun easy way via stringr
stringr::word(string, 2, sep = '\\.')
Here are two options that are vectorized over the input string vector:
You can try tstrsplit from data.table, which is vectorized over string:
> string <- c("abcd.vdgd.dhdsg", "abcd.vdgd.dhdsg.jsgs")
> tstrsplit(string, '.', fixed = TRUE)[[2]]
[1] "vdgd" "vdgd"
or regex:
> sub('.*?\\.(.*?)\\..*', '\\1', string)
[1] "vdgd" "vdgd"`

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Extract characters between specified characters in R

I have this variable
x= "379_exp_mirror1.csv"
I need to extract the number ("379") at the beggining (which doesn't always have 3 characters), i.e. everything before the first "". And then I need to extract everything between the second "" and the ".", in this case "mirror1".
I have tried several combinations with sub and gsub with no success, can anyone give me some indications please?
Thank you
You can use regular expression. For your problem ^(?<Number>[0-9]*)_.* do the job
1/ Test your regular expression with this website : http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Or you can split string with underscore and then try parse (int.TryParse). I think the second is better but if you want to be a regular expression master try the first method
You can use sub to extract the substrings:
x <- "379_exp_mirror1.csv"
sub("_.*", "", x)
# [1] "379"
sub("^(?:.*_){2}(.*?)\\..*", "\\1", x)
# [1] "mirror1"
Another approach with gregexpr:
regmatches(x, gregexpr("^.*?(?=_)|(?<=_)[^_]*?(?=\\.)", x, perl = TRUE))[[1]]
# [1] "379" "mirror1"
May be you can try:
library(stringr)
x <- "379_exp_mirror1.csv"
str_extract_all(x, perl('^[0-9]+(?=_)|[[:alnum:]]+(?=\\.)'))[[1]]
#[1] "379" "mirror1"
Or
strsplit(x, "[._]")[[1]][c(T,F)]
#[1] "379" "mirror1"
Or
scan(text=gsub("[.]","_", x),what="",sep="_")[c(T,F)]
#Read 4 items
#[1] "379" "mirror1"

Resources