Removing/replacing brackets from R string using gsub - r

I want to remove or replace brackets "(" or ")" from my string using gsub. However as shown below it is not working. What could be the reason?
> k<-"(abc)"
> t<-gsub("()","",k)
> t
[1] "(abc)"

Using the correct regex works:
gsub("[()]", "", "(abc)")
The additional square brackets mean "match any of the characters inside".

A safe and simple solution that doesn't rely on regex:
k <- gsub("(", "", k, fixed = TRUE) # "Fixed = TRUE" disables regex
k <- gsub(")", "", k, fixed = TRUE)
k
[1] "abc"

The possible way could be (in the line OP is trying) as:
gsub("\\(|)","","(abc)")
#[1] "abc"
`\(` => look for `(` character. `\` is needed as `(` a special character.
`|` => OR condition
`)` = Look for `)`

Related

Extract last digit [duplicate]

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?
I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"
If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"
Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.
str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>
Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)
UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.
A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"
Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"
An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)
I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.
someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.
For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"
I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.
A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"
Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Regex to add comma between any character

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

Split string in parts by minus and plus in R

I want to split this string:
test = "-1x^2+3x^3-x^8+1-x"
...into parts by plus and minus characters in R. My goal would be to get:
"-1x^2" "+3x^3" "-x^8" "+1" "-x"
This didn't work:
strsplit(test, split = "-")
strsplit(test, split = "+")
We can provide a regular expression in strsplit, where we use ?= to lookahead to find the plus or minus sign, then split on that character. This will allow for the character itself to be retained rather than being dropped in the split.
strsplit(x, "(?<=.)(?=[+])|(?<=.)(?=[-])",perl = TRUE)
# [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
Try
> strsplit(test, split = "(?<=.)(?=[+-])", perl = TRUE)[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
where (?<=.)(?=[+-]) captures the spliter that happens to be in front of + or -.
This uses gsub to search for any character followed by + or - and inserts a semicolon between the two characters. Then it splits on semicolon.
s <- "-1x^2+3x^3-x^8+1-x"
strsplit(gsub("(.)([+-])", "\\1;\\2", s), ";")[[1]]
## [1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
In your examples, you use strsplit with a plus and a minus sign which will split on every encounter.
You could assert that what is directly to the left is not either the start of the string or + or -, while asserting + and - directly to the right.
(?<!^|[+-])(?=[+-])
Explanation
(?<! Negative lookabehind assertion
^ Start of string
| Or - [+-] Match either + or - using a character class
) Close lookbehind
(?= Positive lookahead assertion
[+-] Match either + or -
) Close lookahead
As the pattern uses lookaround assertions, you have to use perl = T to use a perl style regex.
Example
test <- "-1x^2+3x^3-x^8+1-x"
strsplit(test, split = "(?<!^|[\\s+-])(?=[+-])", perl = T)
Output
[[1]]
[1] "-1x^2" "+3x^3" "-x^8" "+1" "-x"
See a online R demo.
If there can also not be a space to the left, you can write the pattern as
(?<!^|[\\s+-])(?=[+-])
See a regex demo.

R: Drop all not matching letters of string vector

I have a string vector
d <- c("sladfj0923rn2", ääas230ß0sadfn", 823Höl32basdflk")
I want to remove all characters from this vector that do not
match "a-z", "A-z" and "'"
I tried to use
gsub("![a-zA-z'], "", d)
but that doesn't work.
We could even make your replacement pattern even tighter by doing a case insensitive sub:
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")
gsub("[^a-z]", "", d, ignore.case=TRUE)
[1] "sladfjrn" "assadfn" "Hlbasdflk"
We can use the ^ inside the square brackets to match all characters except the one specified within the bracket
gsub("[^a-zA-Z]", "", d)
#[1] "sladfjrn" "assadfn" "Hlbasdflk"
data
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")

r sub negation of [:digit:] in regex

I am trying to use subto remove everything between the end of string s (pattern always includes :, digits and parentheses ) and up till but not including the first digit before starting parenthis (.
s <- "NXF1F-Z10_(1:111)"
>sub("\\(1:[[:digit:]]+)$", "", s) #Almost work!
[1] "NXF1F-Z10_"
To remove all characters not a digit (like _ , anything of any length except a digit ) I tried in vain this to negate digits:
sub("[^[:digit:]]*(1:[[:digit:]]+)$", "", s)
The desired output is :
[1] "NXF1F-Z10"
s <- "NXF1F-Z10_(1:111)"
Try this
sub("_.+", "", s)
# "NXF1F-Z10"
More general
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", s, perl=TRUE)
# "NXF1F-Z10"
sub("(\\d)[^\\d]*[(].*[)]$", "\\1", t, perl=TRUE)
# "NXF1F-Z10"
Or this
sub("[(](\\d+):.+", "\\1", s)
# "NXF1F-Z10_1"
Depending on what you want

Resources