How to extract string after 2nd delimiter in R - r

I have my vector as
dt <- c("1:7984985:A:G", "1:7984985-7984985:A:G", "1:7984985-7984985:T:G")
I would like to extract everything after 2nd :.
The result I would like is
A:G , A:G, T:G
What would be the solution for this?

We can use sub to match two instances of one or more characters that are not a : ([^:]+) followed by : from the start (^) of the string and replace it with blank ("")
sub("^([^:]+:){2}", "", dt)
#[1] "A:G" "A:G" "T:G"
It can be also done with trimws (if it is not based on position)
trimws(dt, whitespace = "[-0-9:]")
#[1] "A:G" "A:G" "T:G"
Or using str_remove from stringr
library(stringr)
str_remove(dt, "^([^:]+:){2}")
#[1] "A:G" "A:G" "T:G"

You can use sub, capture the items you want to retain in a capturing group (...) and refer back to them in the replacement argument to sub:
sub("^.:[^:]+:(.:.)", "\\1", dt, perl = T)
[1] "A:G" "A:G" "T:G"
Alternatively, you can use str_extract and positive lookbehind (?<=...):
library(stringr)
str_extract(dt, "(?<=:)[A-Z]:[A-Z]")
[1] "A:G" "A:G" "T:G"

Or simply use str_split which returns a list of 2 values.
´str_split("1:7984985:A:G", "\:",n=3)[[1]][3]´

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

capitalize the first letter of two words separated by underscore using stringr

I have a string like word_string. What I want is Word_String. If I use the function str_to_title from stringr, what I get is Word_string. It does not capitalize the second word.
Does anyone know any elegant way to achieve that with stringr? Thanks!
Here is a base R option using sub:
input <- "word_string"
output <- gsub("(?<=^|_)([a-z])", "\\U\\1", input, perl=TRUE)
output
[1] "Word_String"
The regex pattern used matches and captures any lowercase letter [a-z] which is preceded by either the start of the string (i.e. it's the first letter) or an underscore. Then, we replace with the uppercase version of that single letter. Note that the \U modifier to change to uppercase is a Perl extension, so we must use sub in Perl mode.
Can also use to_any_case from snakecase
library(snakecase)
to_any_case(str1, "title", sep_out = "_")
#[1] "Word_String"
data
str1 <- "word_string"
This is obviously overly complicating but another base possibility:
test <- "word_string"
paste0(unlist(lapply(strsplit(test, "_"),function(x)
paste0(toupper(substring(x,1,1)),
substring(x,2,nchar(x))))),collapse="_")
[1] "Word_String"
You could first use gsub to replace "_" by " " and apply the str_to_title function
Then use gsub again to change it back to your format
x <- str_to_title(gsub("_"," ","word_string"))
gsub(" ","_",x)

String replace with regex condition

I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"
Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

String between first two (.dots)

Hi have data which contains two or more dots. My requirement is to get string from first to second dot.
E.g string <- "abcd.vdgd.dhdsg"
Result expected =vdgd
I have used
pt <-strapply(string, "\\.(.*)\\.", simplify = TRUE)
which is giving correct data but for string having more than two dots its not working as expected.
e.g string <- "abcd.vdgd.dhdsg.jsgs"
its giving dhdsg.jsgs but expected is vdgd
Could anyone help me.
Thanks & Regards,
In base R we can use strsplit
ss <- "abcd.vdgd.dhdsg"
unlist(strsplit(ss, "\\."))[2]
#[1] "vdgd"
Or using gregexpr with regmatches
unlist(regmatches(ss, gregexpr("[^\\.]+", ss)))[2]
#[1] "vdgd"
Or using gsub (thanks #TCZhang)
gsub("^.+?\\.(.+?)\\..*$", "\\1", ss)
#[1] "vdgd"
Another option:
string <- "abcd.vdgd.dhdsg.jsgs"
library(stringr)
str_extract(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[1] "vdgd"
I like this one because the str_extract function will return the first instance of the correct pattern, but you could also use str_extract_all to get all instances.
str_extract_all(string = string, pattern = "(?<=\\.).*?(?=\\.)")
[[1]]
[1] "vdgd" "dhdsg"
From here, you could index to get any position between two dots you want.
Another solution with the qdapRegex package:
library(qdapRegex)
ex_between("abcd.vdgd.dhdsg.jsgs", ".", ".")[[1]][1]
# "vdgd"
You can use read.table as well if you wish.Here providing the string as given in your problem and selecting the separator as dot("."), Once the column is converted into a data.frame, you may choose to select whatever column you want to pick(In this case it is column number 2).
read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
Output:
> read.table(text=string, sep=".",stringsAsFactors = FALSE)[,2]
[1] "vdgd"
Here is a fun easy way via stringr
stringr::word(string, 2, sep = '\\.')
Here are two options that are vectorized over the input string vector:
You can try tstrsplit from data.table, which is vectorized over string:
> string <- c("abcd.vdgd.dhdsg", "abcd.vdgd.dhdsg.jsgs")
> tstrsplit(string, '.', fixed = TRUE)[[2]]
[1] "vdgd" "vdgd"
or regex:
> sub('.*?\\.(.*?)\\..*', '\\1', string)
[1] "vdgd" "vdgd"`

stringr str_extract capture group capturing everything

I'm looking to extract the year from a string. This always comes after an 'X' and before "." then a string of other characters.
Using stringr's str_extract I'm trying the following:
year = str_extract(string = 'X2015.XML.Outgoing.pounds..millions.'
, pattern = 'X(\\d{4})\\.')
I thought the brackets would define the capture group, returning 2015, but I actually get the complete match X2015.
Am I doing this correctly? Why am i not trimming "X" and "."?
The capture group is irrelevant in this case. The function str_extract will return the whole match including characters before and after the capture group.
You have to work with lookbehind and lookahead instead. Their length is zero.
library(stringr)
str_extract(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = '(?<=X)\\d{4}(?=\\.)')
# [1] "2015"
This regex matches four consecutive digits that are preceded by an X and followed by a ..
I believe the most idiomatic way is to use str_match:
str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = 'X(\\d{4})\\.')
Which returns the complete match followed by capture groups:
[,1] [,2]
[1,] "X2015." "2015"
As such the following will do the trick:
str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = 'X(\\d{4})\\.')[2]
Alternatively, you can use gsub:
string = 'X2015.XML.Outgoing.pounds..millions.'
gsub("X(\\d{4})\\..*", "\\1", string)
# [1] "2015"
or str_replace from stringr:
library(stringr)
str_replace(string, "X(\\d{4})\\..*", "\\1")
# [1] "2015"

Resources