Removing a character from within a vector element - r

I have a vector of strings:
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
str.vect
[1] "abcR.1" "abcL.1" "abcR.2" "abcL.2"
How can I remove the third character from the right in each vector element?
Here is the desired result:
"abc.1" "abc.1" "abc.2" "abc.2"
Thank you very much in advance

You can use nchar to find the length of each element of the vector
> nchar(str.vect)
[1] 6 6 6 6
Then you combine this with strtrim to get the beginning of each string
> strtrim(str.vect, nchar(str.vect)-3)
[1] "abc" "abc" "abc" "abc"
To get the end of the word you can then use substr (actually, you could use substr to get the beginning too...)
> substr(str.vect, nchar(str.vect)-1, nchar(str.vect))
[1] ".1" ".1" ".2" ".2"
And finally you use paste0 (which is paste with sep="") to stick them together
> paste0(strtrim(str.vect, nchar(str.vect)-3), # Beginning
substr(str.vect, nchar(str.vect)-1, nchar(str.vect))) # End
[1] "abc.1" "abc.1" "abc.2" "abc.2"
There are easier ways if you know your strings have some special characteristics
For instance, if the length is always 6 you can directly substitute the nchar calls with the appropriate value.
EDIT: alternatively, R also supports regular expressions, which make this task much easier.
> gsub(".(..)$", "\\1", str.vect)
[1] "abc.1" "abc.1" "abc.2" "abc.2"
The syntax is a bit more obscure, but not that difficult once you know what you are looking at.
The first parameter (".(..)$") is what you want to match
. matches any character, $ denotes the end of the string.
So ...$ indicates the last 3 characters in the string.
We put the last two in parenthesis, so that we can store them in memory.
The second parameter tells us what you want to substitute the matched substring with. In our case we put \\1 which means "whatever was in the first pair of parenthesis".
So essentially this command means: "find the last three characters in the string and change them with the last two".

The solution provided by #nico seems fine, but a simpler alternative might be to use sub:
sub('.(.{2})$', '\\1', str.vect)
This searches for the pattern of: "any character (represented by .) followed by 2 of any character (represented by .{2}), followed by the end of the string (represented by $)". By wrapping the .{2} in parentheses, R captures whatever those last two characters were. The second argument is the string to replace the matched substrings with. In this case, we refer to the first string captured in the matched pattern. This is represented by \\1. (If you captured multiple parts of the pattern, with multiple sets of parentheses, you would refer to subsequent captured regions with, e.g. \\2, \\3, etc.)

str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
a <- strsplit(str.vect,split="")
a <- strsplit(str.vect,split="")
b <- unlist(lapply(a,FUN=function(x) {x[4] <- ""
paste(x,collapse="")}
))
If you want to parameterize it further change 4 to a variable and put the index of the character you want to remove there.

Not sure how general or efficient this is, but it seems to work with your example string:
(This seems very similar to nico's answer although I am not using the strtrim function.)
my.string <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
n.char <- nchar(my.string)
the.beginning <- substr(my.string, n.char-(n.char-1), n.char-3)
the.end <- substr(my.string, n.char-1, n.char)
new.string <- paste0(the.beginning, the.end)
new.string
# [1] "abc.1" "abc.1" "abc.2" "abc.2"

The 3rd character from the right of each element is removed.
sapply(str.vec, function(x) gsub(substr(x, nchar(x)-2,nchar(x)-2), "", x))

This is a very quick and dirty answer, but thats what is needed sometimes:
#Define vector
str.vect <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
#Use gsub to remove both 'R' and 'L' independently.
str.vect2 <- gsub("R", '', str.vect )
str.vect_final <- gsub("L", '', str.vect2 )
>str.vect_final
[1] "abc.1" "abc.1" "abc.2" "abc.2"

Related

How to create a regex expression to get a substring between 2 pipes

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:
ENST00000000233.10|ENSG00000004059.11|OTTHUMG000
I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?
Here is a regex.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"
Created on 2022-05-03 by the reprex package (v2.0.1)
Explanation:
^ beginning of string;
[^\\|]* not the pipe character zero or more times;
\\| the pipe character needs to be escaped since it's a meta-character;
^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
([^\\|]+) group match anything but the pipe character at least once;
\\|.*$ the second pipe plus anything until the end of the string.
Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.
Another option is to get the second item after splitting the string on |.
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]
# [1] "ENSG00000004059.11"
Or with tidyverse:
library(tidyverse)
str_split(x, "\\|") %>% map_chr(`[`, 2)
# [1] "ENSG00000004059.11"
Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".
The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).
library(stringr)
x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")
[1] "ENSG00000004059.11"
Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).
Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

Extract all numbers from a character string into a SINGLE character string of numbers in the original order [duplicate]

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Regex to remove all non-digit symbols from string in R

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Prevent grep in R from treating "." as a letter

I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?
grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.
grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

How to remove last n characters from every element in the R vector

I am very new to R, and I could not find a simple example online of how to remove the last n characters from every element of a vector (array?)
I come from a Java background, so what I would like to do is to iterate over every element of a$data and remove the last 3 characters from every element.
How would you go about it?
Here is an example of what I would do. I hope it's what you're looking for.
char_array = c("foo_bar","bar_foo","apple","beer")
a = data.frame("data"=char_array,"data2"=1:4)
a$data = substr(a$data,1,nchar(a$data)-3)
a should now contain:
data data2
1 foo_ 1
2 bar_ 2
3 ap 3
4 b 4
Here's a way with gsub:
cs <- c("foo_bar","bar_foo","apple","beer")
gsub('.{3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b"
Although this is mostly the same with the answer by #nfmcclure, I prefer using stringr package as it provdies a set of functions whose names are most consistent and descriptive than those in base R (in fact I always google for "how to get the number of characters in R" as I can't remember the name nchar()).
library(stringr)
str_sub(iris$Species, end=-4)
#or
str_sub(iris$Species, 1, str_length(iris$Species)-3)
This removes the last 3 characters from each value at Species column.
The same may be achieved with the stringi package:
library('stringi')
char_array <- c("foo_bar","bar_foo","apple","beer")
a <- data.frame("data"=char_array, "data2"=1:4)
(a$data <- stri_sub(a$data, 1, -4)) # from the first to the (last-4)-th character
## [1] "foo_" "bar_" "ap" "b"
Similar to #Matthew_Plourde using gsub
However, using a pattern that will trim to zero characters i.e. return "" if the original string is shorter than the number of characters to cut:
cs <- c("foo_bar","bar_foo","apple","beer","so","a")
gsub('.{0,3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b" "" ""
Difference is, {0,3} quantifier indicates 0 to 3 matches, whereas {3} requires exactly 3 matches otherwise no match is found in which case gsub returns the original, unmodified string.
N.B. using {,3} would be equivalent to {0,3}, I simply prefer the latter notation.
See here for more information on regex quantifiers:
https://www.regular-expressions.info/refrepeat.html
friendly hint when working with n characters of a string to cut off/replace:
--> be aware of whitespaces in your strings!
use base::gsub(' ', '', x, fixed = TRUE) to get rid of unwanted whitespaces in your strings. i spent quite some time to find out why the great solutions provided above did not work for me. thought it might be useful for others as well ;)

Resources