I have the following string in R
A<-"A (23) 56 hh()"
I want to get the following output
"A (23) 56 hh"
I tried the following code
B<-gsub(pattern = "()", replacement = "", x = A)
That didnt yield the desired result. How can I accomplish the same
Try fixed = TRUE in gsub
> gsub("()", "", A, fixed = TRUE)
[1] "A (23) 56 hh"
Using str_remove
library(stringr)
str_remove_all(A, fixed("()"))
-ouptut
[1] "A (23) 56 hh"
try B<-gsub(pattern = "\\(\\)", replacement = "", x = A)
\\ indicates that it is a specific character - not the regex expression in brackets
dy_by and ThomasIsCoding have good answers. Here is a modification of dy_by's answer
gsub(pattern = "\\()", replacement = "", x = A)
[1] "A (23) 56 hh"
Another option defining removal of two consecutive parenthesis chars, which obviates the need for fixed=TRUE:
library(stringr)
A %>% str_remove("[()]{2}")
[1] "A (23) 56 hh"
Related
I have the following sample dataset:
XYZ 185g
ABC 60G
Gha 20g
How do I remove the strings "185g", "60G", "20g" without accidentally removing the alphabets g and G in the main words?
I tried the below code but it replaces the alphabets in the main words as well.
a <- str_replace_all(a$words,"[0-9]"," ")
a <- str_replace_all(a$words,"[gG]"," ")
You need to combine them into something like
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]$", "")
The \s*\d+[gG]$ regex matches
\s* - zero or more whitespaces
\d+ - one or more digits
[gG] - g or G
$ - end of string.
If you can have these strings inside a string, not just at the end, you may use
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]\\b", "")
where $ is replaced with a \b, a word boundary.
To ignore case,
a$words <- str_replace_all(a$words, regex("\\s*\\d+g\\b", ignore_case=TRUE), "")
You can try
> gsub("\\s\\d+g$", "", c("XYZ 185g", "ABC 60G", "Gha 20g"), ignore.case = TRUE)
[1] "XYZ" "ABC" "Gha"
You can also use the following solution:
vec <- c("XYZ 185g", "ABC 60G", "Gha 20g")
gsub("[A-Za-z]+(*SKIP)(*FAIL)|[ 0-9Gg]+", "", vec, perl = TRUE)
[1] "XYZ" "ABC" "Gha"
I am trying to count the number of dots in a character string.
I have tried to use str_count but it gives me the number of letters of the string instead.
ex_str <- "This.is.a.string"
str_count(ex_str, '.')
nchar(ex_str)
. is a special regex symbol, so you need to escape it:
str_count(ex_str, '\\.')
# [1] 3
Using just base R you could do:
nchar(gsub("[^.]", "", ex_str))
Using stringi:
stri_count_fixed(ex_str, '.')
Another base R solution could be:
length(grepRaw(".", ex_str, fixed = TRUE, all = TRUE))
[1] 3
You may also use the base function gregexpr:
sum(gregexpr(".", ex_str, fixed=TRUE)[[1]] > 0)
[1] 3
You can use stringr::str_count with a fixed(...) argument to avoid treating it as a regular expression:
str_count(ex_str, fixed('.'))
See the online R demo:
library(stringr)
ex_str <- "This.is.a.string"
str_count(ex_str, fixed('.'))
## => [1] 3
I want to remove or replace brackets "(" or ")" from my string using gsub. However as shown below it is not working. What could be the reason?
> k<-"(abc)"
> t<-gsub("()","",k)
> t
[1] "(abc)"
Using the correct regex works:
gsub("[()]", "", "(abc)")
The additional square brackets mean "match any of the characters inside".
A safe and simple solution that doesn't rely on regex:
k <- gsub("(", "", k, fixed = TRUE) # "Fixed = TRUE" disables regex
k <- gsub(")", "", k, fixed = TRUE)
k
[1] "abc"
The possible way could be (in the line OP is trying) as:
gsub("\\(|)","","(abc)")
#[1] "abc"
`\(` => look for `(` character. `\` is needed as `(` a special character.
`|` => OR condition
`)` = Look for `)`
I am struggling to remove the substring before the underscore in my string.
I want to use * (wildcard) as the bit before the underscore can vary:
a <- c("foo_5", "bar_7")
a <- gsub("*_", "", a, perl = TRUE)
The result should look like:
> a
[1] 5 7
I also tried stuff like "^*" or "?" but did not really work.
The following code works on your example :
gsub(".*_", "", a)
Alternatively, you can also try:
gsub("\\S+_", "", a)
Just to point out that there is an approach using functions from the tidyverse, which I find more readable than gsub:
a %>% stringr::str_remove(pattern = ".*_")
as.numeric(gsub(pattern=".*_", replacement = '', a)
[1] 5 7
So " xx yy 11 22 33 " will become "xxyy112233". How can I achieve this?
In general, we want a solution that is vectorised, so here's a better test example:
whitespace <- " \t\n\r\v\f" # space, tab, newline,
# carriage return, vertical tab, form feed
x <- c(
" x y ", # spaces before, after and in between
" \u2190 \u2192 ", # contains unicode chars
paste0( # varied whitespace
whitespace,
"x",
whitespace,
"y",
whitespace,
collapse = ""
),
NA # missing
)
## [1] " x y "
## [2] " ← → "
## [3] " \t\n\r\v\fx \t\n\r\v\fy \t\n\r\v\f"
## [4] NA
The base R approach: gsub
gsub replaces all instances of a string (fixed = TRUE) or regular expression (fixed = FALSE, the default) with another string. To remove all spaces, use:
gsub(" ", "", x, fixed = TRUE)
## [1] "xy" "←→"
## [3] "\t\n\r\v\fx\t\n\r\v\fy\t\n\r\v\f" NA
As DWin noted, in this case fixed = TRUE isn't necessary but provides slightly better performance since matching a fixed string is faster than matching a regular expression.
If you want to remove all types of whitespace, use:
gsub("[[:space:]]", "", x) # note the double square brackets
## [1] "xy" "←→" "xy" NA
gsub("\\s", "", x) # same; note the double backslash
library(regex)
gsub(space(), "", x) # same
"[:space:]" is an R-specific regular expression group matching all space characters. \s is a language-independent regular-expression that does the same thing.
The stringr approach: str_replace_all and str_trim
stringr provides more human-readable wrappers around the base R functions (though as of Dec 2014, the development version has a branch built on top of stringi, mentioned below). The equivalents of the above commands, using [str_replace_all][3], are:
library(stringr)
str_replace_all(x, fixed(" "), "")
str_replace_all(x, space(), "")
stringr also has a str_trim function which removes only leading and trailing whitespace.
str_trim(x)
## [1] "x y" "← →" "x \t\n\r\v\fy" NA
str_trim(x, "left")
## [1] "x y " "← → "
## [3] "x \t\n\r\v\fy \t\n\r\v\f" NA
str_trim(x, "right")
## [1] " x y" " ← →"
## [3] " \t\n\r\v\fx \t\n\r\v\fy" NA
The stringi approach: stri_replace_all_charclass and stri_trim
stringi is built upon the platform-independent ICU library, and has an extensive set of string manipulation functions. The equivalents of the above are:
library(stringi)
stri_replace_all_fixed(x, " ", "")
stri_replace_all_charclass(x, "\\p{WHITE_SPACE}", "")
Here "\\p{WHITE_SPACE}" is an alternate syntax for the set of Unicode code points considered to be whitespace, equivalent to "[[:space:]]", "\\s" and space(). For more complex regular expression replacements, there is also stri_replace_all_regex.
stringi also has trim functions.
stri_trim(x)
stri_trim_both(x) # same
stri_trim(x, "left")
stri_trim_left(x) # same
stri_trim(x, "right")
stri_trim_right(x) # same
I just learned about the "stringr" package to remove white space from the beginning and end of a string with str_trim( , side="both") but it also has a replacement function so that:
a <- " xx yy 11 22 33 "
str_replace_all(string=a, pattern=" ", repl="")
[1] "xxyy112233"
x = "xx yy 11 22 33"
gsub(" ", "", x)
> [1] "xxyy112233"
Use [[:blank:]] to match any kind of horizontal white_space characters.
gsub("[[:blank:]]", "", " xx yy 11 22 33 ")
# [1] "xxyy112233"
Please note that soultions written above removes only space. If you want also to remove tab or new line use stri_replace_all_charclass from stringi package.
library(stringi)
stri_replace_all_charclass(" ala \t ma \n kota ", "\\p{WHITE_SPACE}", "")
## [1] "alamakota"
The function str_squish() from package stringr of tidyverse does the magic!
library(dplyr)
library(stringr)
df <- data.frame(a = c(" aZe aze s", "wxc s aze "),
b = c(" 12 12 ", "34e e4 "),
stringsAsFactors = FALSE)
df <- df %>%
rowwise() %>%
mutate_all(funs(str_squish(.))) %>%
ungroup()
df
# A tibble: 2 x 2
a b
<chr> <chr>
1 aZe aze s 12 12
2 wxc s aze 34e e4
Another approach can be taken into account
library(stringr)
str_replace_all(" xx yy 11 22 33 ", regex("\\s*"), "")
#[1] "xxyy112233"
\\s: Matches Space, tab, vertical tab, newline, form feed, carriage return
*: Matches at least 0 times
income<-c("$98,000.00 ", "$90,000.00 ", "$18,000.00 ", "")
To remove space after .00 use the trimws() function.
income<-trimws(income)
From stringr library you could try this:
Remove consecutive fill blanks
Remove fill blank
library(stringr)
2. 1.
| |
V V
str_replace_all(str_trim(" xx yy 11 22 33 "), " ", "")