Remove substring in string and keep some substring - r

PES+PWA+PWH
I have the above string in R in a data frame, I have to write a script such that if it finds PES then it keeps PES and removes the rest.
I only want PES in the output

The grouping operator in R's regex would allow removal of non-"PES" characters:
gsub("(.*)(PES)(.*)", "\\2", c("PES+PWA+PWH", "something else") )
#[1] "PES" "something else"
The problem description wasn't very clear since another respondent nterpreted your request very differntly

text <- c("hello", "PES+PWA+PWH", "world")
text[grepl("PES", text)] <- "PES"
# "hello" "PES" "world"

From your question above I assumed you meant you wanted to have only those rows that contain PES?
You can use the grep function in R for that
column<-c("PES-PSA","PES","PWS","PWA","PES+PWA+PWH")
column[grep("PES",column)]
[1] "PES-PSA" "PES" "PES+PWA+PWH"
Grep takes the string to match as its first argument and the vector you want to match in as the second.

Related

How to create a regex expression to get a substring between 2 pipes

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^ beginning of string;
[^\\|]* not the pipe character zero or more times;
\\| the pipe character needs to be escaped since it's a meta-character;
^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
([^\\|]+) group match anything but the pipe character at least once;
\\|.*$ the second pipe plus anything until the end of the string.
Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")
[1] "ENSG00000004059.11"
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

Partial Match word from sentence in R

I am looking to partial match string using %in% operator in R when I run below I get FALSE
'I just want to partial match string' %in% 'partial'
FALSE
Expected Output is TRUE in above case (because it is matched partially)
Since you want to match partially from a sentence you should try using %like% from data.table, check below
library(data.table)
'I just want to partial match string' %like% 'partial'
TRUE
The output is TRUE
`%in_str%` <- function(pattern,s){
grepl(pattern, s)
}
Usage:
> 'a' %in_str% 'abc'
[1] TRUE
You need to strsplit the string so each word in it is its own element in a vector:
"partial" %in% unlist(strsplit('I just want to partial match string'," "))
[1] TRUE
strsplit takes a string and breaks it into a vector of shorter strings. In this case, it breaks on the space (that's the " " at the end), so that you get a vector of individual words. Unfortunately, strstring defaults to save its results as a list, which is why I wrapped it in an unlist - so we get a single vector.
Then we do the %in%, which works in the opposite direction from the one you used: you're trying to find out if string partial is %in% the sentence, not the other way around.
Of course, this is an annoying way of doing it, so it's probably better to go with a grep-based solution if you want to stay within base-R, or Priyanka's data.table solution above -- both of which will also be better at stuff like matching multiple-word strings.

Prevent grep in R from treating "." as a letter

I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?
grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.
grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

How do i insert a certain string in another string at a particular location in r?

I am new to R. It may be a very simple thing to do but I am not able figure it out.
Say, I have a string as follows:
This is an example string.
Now I want to make it as follows:
This is an (example/sample) string.
I know the location at which the change is to be made. (12th character in the given string).
I have a lot of strings where i need to perform similar operation.
I think I don't understand the problem but if I do you could use gsub here:
x <- "This is an example string."
gsub("example", "(example/sample)", x)
## [1] "This is an (example/sample) string."
Here's one solution with regular expressions:
# the string
s <- "This is an example string."
# the position of the target's first character
pos <- 12
# create a regular expression
reg <- paste0("^(.{", pos - 1, "})(.+?\\b)(.*)")
# [1] "^(.{11})(.+?\\b)(.*)"
# modify string
sub(reg, "\\1\\(\\2/sample\\)\\3", s)
# [1] "This is an (example/sample) string."
Here's another regex flavoured solution using a lookbehind:
s <- "This is an example string."
pos <- 12
replacement <- '(example/sample)'
sub(sprintf('(?<=^.{%s})\\S*\\b', pos-1), replacement, s, perl=TRUE)
## [1] "This is an (example/sample) string."
Lookbehind (?<=x) is useful because regex within it is part of the pattern but doesn't become part of the match (so we don't have to capture them and replace them later). The pattern above says: "The beginning of the string, followed by 11 characters, preceding zero or more non-whitespace characters, followed by a word boundary. Only the non-whitespace characters are replaced, by replacement.
Update
An alternative is to use strsplit to create a vector of words, and then identify the position in the vector of the character of interest (e.g. the 12th character), subsequently replacing that element with your new word. This is a bit slower than the regex approach, but makes it straightforward to request multiple replacements (at multiple character positions). For example:
f <- function(string, pos, new) {
s <- strsplit(string, '\\s')[[1]]
i <- findInterval(pos, c(gregexpr('(?<=\\b)\\w', string, perl=TRUE)[[1]],
nchar(string)))
s[i] <- mapply(sub, s[i], patt='\\b[[:alnum:]-]+\\b', repl=new, perl=TRUE)
paste0(s, collapse=' ')
}
f('This is an example string.', c(12, 20), c('excellent', 'function'))
## [1] "This is an excellent function."
Note that this hyphenated words are fully replaced (i.e. not just the part up to a hyphen) by the replacement, and all other punctuation (outside boundaries of hyphenated words) is retained.

Removing a character from within a vector element

I have a vector of strings:
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
str.vect
[1] "abcR.1" "abcL.1" "abcR.2" "abcL.2"
How can I remove the third character from the right in each vector element?
Here is the desired result:
"abc.1" "abc.1" "abc.2" "abc.2"
Thank you very much in advance
You can use nchar to find the length of each element of the vector
> nchar(str.vect)
[1] 6 6 6 6
Then you combine this with strtrim to get the beginning of each string
> strtrim(str.vect, nchar(str.vect)-3)
[1] "abc" "abc" "abc" "abc"
To get the end of the word you can then use substr (actually, you could use substr to get the beginning too...)
> substr(str.vect, nchar(str.vect)-1, nchar(str.vect))
[1] ".1" ".1" ".2" ".2"
And finally you use paste0 (which is paste with sep="") to stick them together
> paste0(strtrim(str.vect, nchar(str.vect)-3), # Beginning
substr(str.vect, nchar(str.vect)-1, nchar(str.vect))) # End
[1] "abc.1" "abc.1" "abc.2" "abc.2"
There are easier ways if you know your strings have some special characteristics
For instance, if the length is always 6 you can directly substitute the nchar calls with the appropriate value.
EDIT: alternatively, R also supports regular expressions, which make this task much easier.
> gsub(".(..)$", "\\1", str.vect)
[1] "abc.1" "abc.1" "abc.2" "abc.2"
The syntax is a bit more obscure, but not that difficult once you know what you are looking at.
The first parameter (".(..)$") is what you want to match
. matches any character, $ denotes the end of the string.
So ...$ indicates the last 3 characters in the string.
We put the last two in parenthesis, so that we can store them in memory.
The second parameter tells us what you want to substitute the matched substring with. In our case we put \\1 which means "whatever was in the first pair of parenthesis".
So essentially this command means: "find the last three characters in the string and change them with the last two".
The solution provided by #nico seems fine, but a simpler alternative might be to use sub:
sub('.(.{2})$', '\\1', str.vect)
This searches for the pattern of: "any character (represented by .) followed by 2 of any character (represented by .{2}), followed by the end of the string (represented by $)". By wrapping the .{2} in parentheses, R captures whatever those last two characters were. The second argument is the string to replace the matched substrings with. In this case, we refer to the first string captured in the matched pattern. This is represented by \\1. (If you captured multiple parts of the pattern, with multiple sets of parentheses, you would refer to subsequent captured regions with, e.g. \\2, \\3, etc.)
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
a <- strsplit(str.vect,split="")
a <- strsplit(str.vect,split="")
b <- unlist(lapply(a,FUN=function(x) {x[4] <- ""
paste(x,collapse="")}
))
If you want to parameterize it further change 4 to a variable and put the index of the character you want to remove there.
Not sure how general or efficient this is, but it seems to work with your example string:
(This seems very similar to nico's answer although I am not using the strtrim function.)
my.string <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
n.char <- nchar(my.string)
the.beginning <- substr(my.string, n.char-(n.char-1), n.char-3)
the.end <- substr(my.string, n.char-1, n.char)
new.string <- paste0(the.beginning, the.end)
new.string
# [1] "abc.1" "abc.1" "abc.2" "abc.2"
The 3rd character from the right of each element is removed.
sapply(str.vec, function(x) gsub(substr(x, nchar(x)-2,nchar(x)-2), "", x))
This is a very quick and dirty answer, but thats what is needed sometimes:
#Define vector
str.vect <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
#Use gsub to remove both 'R' and 'L' independently.
str.vect2 <- gsub("R", '', str.vect )
str.vect_final <- gsub("L", '', str.vect2 )
>str.vect_final
[1] "abc.1" "abc.1" "abc.2" "abc.2"

Resources