How to apply the function substr to each element of a string - r

I have this string of words
string<-c("chair-desk-tree-table-computer-mousse")
I want to retrieve the first three characters of each word and store them in an object like that:
newstring==> [1] "cha-des-tre-tab-com-mou"

> newstring <- substring( strsplit(string, "-")[[1]], 1, 3)
> newstring <- paste0(newstring, collapse = "-")
> newstring
[1] "cha-des-tre-tab-com-mou"

Using gsub with a regex lookaround to match one or more lower case letters that precede 3 lower case letters
gsub("(?<=\\b[a-z]{3})[a-z]+", "", string, perl = TRUE)
[1] "cha-des-tre-tab-com-mou"
Using the edited string
> string <- c(string, "K29-E665-I1190")
> gsub("(?<=\\b[[:alnum:]]{3})[[:alnum:]]+", "", string, perl = TRUE)
[1] "cha-des-tre-tab-com-mou" "K29-E66-I11"

Related

Regex to add comma between any character

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

Extract a substring in R with no pattern

If one of my strings in a column looks like,
string = "P/project/dhi_intro_genomics/genomics/gene/pag-files-per-patient/000tg82e-99c4-4h20-9ude-d95e15005a 3c_KXgES5FtCpLhQce7mGkuMX/XML/JH_DN_S9_2000-12-27_MTW-29FEB1997UW"
Is there a str_extract code to get
sub_string = "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
from the original 'string'?
We can use the pattern to get the substring that are not a _ character after the patient/ substring
library(stringr)
str_extract(string, "(?<=patient\\/)0+[^_]+")
[1] "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
If there are no pattern and wants to extract the 7th element based on delimiter /
trimws(strsplit(string, "/")[[1]][7], whitespace = "_.*")
[1] "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
Or with str_replace
str_replace(string, "([^/]+/){6}([^_]+)_.*", "\\2")
[1] "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
For the new string
str_new <- c("/P/project/dlf_intro_aion/Y0793/Y0793_8665030498_T1_K1IJ2_ps20200918125614.htj.gz.pd5",
"/P/project/dlf_intro_aion/H051/H0518946_032983_T1_K1ID2_ps20289239171246.par.gz"
)
str_replace(str_new, "^/?([^/]+/){4}[^_]+_([^_]+)_.*", "\\2")
[1] "8665030498" "032983"

A regex to remove the pattern "[0-9]g"

I have the following sample dataset:
XYZ 185g
ABC 60G
Gha 20g
How do I remove the strings "185g", "60G", "20g" without accidentally removing the alphabets g and G in the main words?
I tried the below code but it replaces the alphabets in the main words as well.
a <- str_replace_all(a$words,"[0-9]"," ")
a <- str_replace_all(a$words,"[gG]"," ")
You need to combine them into something like
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]$", "")
The \s*\d+[gG]$ regex matches
\s* - zero or more whitespaces
\d+ - one or more digits
[gG] - g or G
$ - end of string.
If you can have these strings inside a string, not just at the end, you may use
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]\\b", "")
where $ is replaced with a \b, a word boundary.
To ignore case,
a$words <- str_replace_all(a$words, regex("\\s*\\d+g\\b", ignore_case=TRUE), "")
You can try
> gsub("\\s\\d+g$", "", c("XYZ 185g", "ABC 60G", "Gha 20g"), ignore.case = TRUE)
[1] "XYZ" "ABC" "Gha"
You can also use the following solution:
vec <- c("XYZ 185g", "ABC 60G", "Gha 20g")
gsub("[A-Za-z]+(*SKIP)(*FAIL)|[ 0-9Gg]+", "", vec, perl = TRUE)
[1] "XYZ" "ABC" "Gha"

regex strsplit expression in R so it only applies once to the first occurrence of a specific character in each string?

I have a list filled with strings:
string<- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L")
I need to split the strings so they appear like:
"SPG_L", "subgenual_ACC_R", "SPG_R", "MTG_L_pole", "MTG_L_pole", "CerebellumGM_L"
I tried using the following regex expression to split the strings:
str_split(string,'(?<=[[RL]|pole])_')
But this leads to:
"SPG_L", "subgenual" "ACC_R", "SPG_R", "MTG_L", "pole", "MTG_L", "pole", "CerebellumGM_L"
How do I edit the regex expression so it splits each string element at the "_" after the first occurrence of "R", "L" unless the first occurrence of "R" or "L" is followed by "pole", then it splits the string element after the first occurrence of "pole" and only splits each string element once?
I suggest a matching approach using
^(.*?[RL](?:_pole)?)_(.*)
See the regex demo
Details
^ - start of string
(.*?[RL](?:_pole)?) - Group 1:
.*? - any zero or more chars other than line break chars as few as possible
[RL](?:_pole)? - R or L optionally followed with _pole
_ - an underscore
(.*) - Group 2: any zero or more chars other than line break chars as many as possible
See the R demo:
library(stringr)
x <- c("SPG_L_subgenual_ACC_R", "SPG_R_MTG_L_pole", "MTG_L_pole_CerebellumGM_L", "SFG_pole_R_IFG_triangularis_L", "SFG_pole_R_IFG_opercularis_L" )
res <- str_match_all(x, "^(.*?[RL](?:_pole)?)_(.*)")
lapply(res, function(x) x[-1])
Output:
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"
[[4]]
[1] "SFG_pole_R" "IFG_triangularis_L"
[[5]]
[1] "SFG_pole_R" "IFG_opercularis_L"
split_again = function(x){
if(length(x) > 1){
return(x)
}
else{
str_split(
string = x,
pattern = '(?<=[R|L])_',
n = 2)
}
}
str_split(
string = string,
pattern = '(?<=pole)_',
n = 2) %>%
lapply(split_again) %>%
unlist()
you could use sub then strsplit as shown:
strsplit(sub("^.*?[LR](?:_pole)?\\K_",":",string,perl=TRUE),":")
[[1]]
[1] "SPG_L" "subgenual_ACC_R"
[[2]]
[1] "SPG_R" "MTG_L_pole"
[[3]]
[1] "MTG_L_pole" "CerebellumGM_L"

Removing/replacing brackets from R string using gsub

I want to remove or replace brackets "(" or ")" from my string using gsub. However as shown below it is not working. What could be the reason?
> k<-"(abc)"
> t<-gsub("()","",k)
> t
[1] "(abc)"
Using the correct regex works:
gsub("[()]", "", "(abc)")
The additional square brackets mean "match any of the characters inside".
A safe and simple solution that doesn't rely on regex:
k <- gsub("(", "", k, fixed = TRUE) # "Fixed = TRUE" disables regex
k <- gsub(")", "", k, fixed = TRUE)
k
[1] "abc"
The possible way could be (in the line OP is trying) as:
gsub("\\(|)","","(abc)")
#[1] "abc"
`\(` => look for `(` character. `\` is needed as `(` a special character.
`|` => OR condition
`)` = Look for `)`

Resources