How to remove all whitespace from a string? - r

So " xx yy 11 22 33 " will become "xxyy112233". How can I achieve this?

In general, we want a solution that is vectorised, so here's a better test example:
whitespace <- " \t\n\r\v\f" # space, tab, newline,
# carriage return, vertical tab, form feed
x <- c(
" x y ", # spaces before, after and in between
" \u2190 \u2192 ", # contains unicode chars
paste0( # varied whitespace
whitespace,
"x",
whitespace,
"y",
whitespace,
collapse = ""
),
NA # missing
)
## [1] " x y "
## [2] " ← → "
## [3] " \t\n\r\v\fx \t\n\r\v\fy \t\n\r\v\f"
## [4] NA
The base R approach: gsub
gsub replaces all instances of a string (fixed = TRUE) or regular expression (fixed = FALSE, the default) with another string. To remove all spaces, use:
gsub(" ", "", x, fixed = TRUE)
## [1] "xy" "←→"
## [3] "\t\n\r\v\fx\t\n\r\v\fy\t\n\r\v\f" NA
As DWin noted, in this case fixed = TRUE isn't necessary but provides slightly better performance since matching a fixed string is faster than matching a regular expression.
If you want to remove all types of whitespace, use:
gsub("[[:space:]]", "", x) # note the double square brackets
## [1] "xy" "←→" "xy" NA
gsub("\\s", "", x) # same; note the double backslash
library(regex)
gsub(space(), "", x) # same
"[:space:]" is an R-specific regular expression group matching all space characters. \s is a language-independent regular-expression that does the same thing.
The stringr approach: str_replace_all and str_trim
stringr provides more human-readable wrappers around the base R functions (though as of Dec 2014, the development version has a branch built on top of stringi, mentioned below). The equivalents of the above commands, using [str_replace_all][3], are:
library(stringr)
str_replace_all(x, fixed(" "), "")
str_replace_all(x, space(), "")
stringr also has a str_trim function which removes only leading and trailing whitespace.
str_trim(x)
## [1] "x y" "← →" "x \t\n\r\v\fy" NA
str_trim(x, "left")
## [1] "x y " "← → "
## [3] "x \t\n\r\v\fy \t\n\r\v\f" NA
str_trim(x, "right")
## [1] " x y" " ← →"
## [3] " \t\n\r\v\fx \t\n\r\v\fy" NA
The stringi approach: stri_replace_all_charclass and stri_trim
stringi is built upon the platform-independent ICU library, and has an extensive set of string manipulation functions. The equivalents of the above are:
library(stringi)
stri_replace_all_fixed(x, " ", "")
stri_replace_all_charclass(x, "\\p{WHITE_SPACE}", "")
Here "\\p{WHITE_SPACE}" is an alternate syntax for the set of Unicode code points considered to be whitespace, equivalent to "[[:space:]]", "\\s" and space(). For more complex regular expression replacements, there is also stri_replace_all_regex.
stringi also has trim functions.
stri_trim(x)
stri_trim_both(x) # same
stri_trim(x, "left")
stri_trim_left(x) # same
stri_trim(x, "right")
stri_trim_right(x) # same

I just learned about the "stringr" package to remove white space from the beginning and end of a string with str_trim( , side="both") but it also has a replacement function so that:
a <- " xx yy 11 22 33 "
str_replace_all(string=a, pattern=" ", repl="")
[1] "xxyy112233"

x = "xx yy 11 22 33"
gsub(" ", "", x)
> [1] "xxyy112233"

Use [[:blank:]] to match any kind of horizontal white_space characters.
gsub("[[:blank:]]", "", " xx yy 11 22 33 ")
# [1] "xxyy112233"

Please note that soultions written above removes only space. If you want also to remove tab or new line use stri_replace_all_charclass from stringi package.
library(stringi)
stri_replace_all_charclass(" ala \t ma \n kota ", "\\p{WHITE_SPACE}", "")
## [1] "alamakota"

The function str_squish() from package stringr of tidyverse does the magic!
library(dplyr)
library(stringr)
df <- data.frame(a = c(" aZe aze s", "wxc s aze "),
b = c(" 12 12 ", "34e e4 "),
stringsAsFactors = FALSE)
df <- df %>%
rowwise() %>%
mutate_all(funs(str_squish(.))) %>%
ungroup()
df
# A tibble: 2 x 2
a b
<chr> <chr>
1 aZe aze s 12 12
2 wxc s aze 34e e4

Another approach can be taken into account
library(stringr)
str_replace_all(" xx yy 11 22 33 ", regex("\\s*"), "")
#[1] "xxyy112233"
\\s: Matches Space, tab, vertical tab, newline, form feed, carriage return
*: Matches at least 0 times

income<-c("$98,000.00 ", "$90,000.00 ", "$18,000.00 ", "")
To remove space after .00 use the trimws() function.
income<-trimws(income)

From stringr library you could try this:
Remove consecutive fill blanks
Remove fill blank
library(stringr)
2. 1.
| |
V V
str_replace_all(str_trim(" xx yy 11 22 33 "), " ", "")

Related

Parsing text in r without separator

I need help with ideas for parsing this text.
I want do it the most automatic way possible.
This is the text
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
I need this result:
a
b
JOHN DEERE
PMWF2126
NEW HOLLAND
441702A1
HIFI
WE 2126
CUMMINS
4907485
This is an example, there is a different marks an item id
I try:
str_split(text, " ")
[[1]]
[1] "JOHN" "DEERE:" "PMWF2126" "NEW" "HOLLAND:" "441702A1" "HIFI:" "WE" "2126"
[10] "CUMMINS:" "4907485" "CUMMINS:" "3680433" "CUMMINS:" "3680315" "CUMMINS:" "3100310"
Thanks!
Edit:
Thanks for your answers, very helpfull
But there is anoter case where can end with a letter to
text <- "LANSS: EF903R DARMET: VP-2726/S CASE: 133721A1 JOHN DEERE: RE68049 JCB: 32917302 WIX: 46490 TURBO: TR25902 HIFI: SA 16080 CATERPILLAR: 4431570 KOMATSU: Z7602BXK06 KOMATSU: Z7602BX106 KOMATSU: YM12991012501 KOMATSU: YM12991012500 KOMATSU: YM11900512571 KOMATSU: 6001851320 KOMATSU: 6001851300 KOMATSU: 3EB0234790 KOMATSU: 11900512571"
We can use separate_rows and separate from tidyr for this task:
library(tidyverse)
data.frame(text) %>%
# separate into rows:
separate_rows(text, sep = "(?<=\\d)\\s") %>%
# separate into columns:
separate(text,
into = c("a", "b"),
sep = ":\\s")
# A tibble: 4 × 2
a b
<chr> <chr>
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
The split point for separate_rows uses look-behind (?<=\\d) to assert that the whitespace \\s on which the string is broken must be preceded by a \\digit.
Data:
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
Thje sulution assumes (as in your sample data), that the second value always ends with a number, and the first column does not.
If this s not the case, you'll have to adapt the regex-part (?<=[0-9] )(?=[A-Z]), so that the splitting point lies between the two round-bracketed parts.
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
lapply(
strsplit(
unlist(strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE)),
":"), trimws)
[[1]]
[1] "JOHN DEERE" "PMWF2126"
[[2]]
[1] "NEW HOLLAND" "441702A1"
[[3]]
[1] "HIFI" "WE 2126"
[[4]]
[1] "CUMMINS" "4907485"
the key part is the strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE) part.
This looks for occurences where, after a numeric value followed by a space ?<=[0-9] , there is a new part, starting with a capital ?=[A-Z].
These positions are the used as splitting points
Since the second field always ends in a digit and the first field does not, replace a digit followed by space with that digit and a newline and then use read.table with a colon separator.
text |>
gsub("(\\d) ", "\\1\n", x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)
giving
V1 V2
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
If in your data the second field can have a digit but the first cannot and the digit is not necessarily at the end of the last word in field two but could be anywhere in the last word in field 2 then we can use this variation which gives the same result here. gsubfn is like gsub except the 2nd argument can be a function instead of a replacement string and it takes the capture group as input and replaces the entire match with the output of the function. The function can be expressed in formula notation as is done here.
library(gsubfn)
text |>
gsubfn("\\w+", ~ if (grepl("[0-9]", x)) paste(x, "\n") else x, x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)

How to drop specific characters in strings in a column?

How can I drop "-" or double "--" only at the beginning of the value in the text column?
df <- data.frame (x = c(12,14,15,178),
text = c("--Car","-Transport","Big-Truck","--Plane"))
x text
1 12 --Car
2 14 -Transport
3 15 Big-Truck
4 178 --Plane
Expected output:
x text
1 12 Car
2 14 Transport
3 15 Big-Truck
4 178 Plane
You can use gsub and the following regex "^\\-+". ^ states that the match should be at the beginning of the string, and that it should be 1 or more (+) hyphen (\\-).
gsub("^\\-+", "", df$text)
# [1] "Car" "Transport" "Big-Truck" "Plane"
If there are whitespaces in the beginning of the string and you want to remove them, you can use [ -]+ in your regex. It tells to match if there are repeated whitespaces or hyphens in the beginning of your string.
gsub("^[ -]+", "", df$text)
To apply this to the dataframe, just do this. In tidyverse, you can also use str_remove:
df$text <- gsub("^\\-+", "", df$text)
# or, in dplyr
library(tidyverse)
df %>%
mutate(text1 = gsub("^\\-+", "", text),
text2 = str_remove(text, "^\\-+"))
You could use trimws to remove certain leading/trailing characters.
trimws(df$text, whitespace = '[ -]')
# [1] "Car" "Transport" "Big-Truck" "Plane"
# a more complex situation
x <- " -- Car - -"
trimws(x, whitespace = '[ -]')
# [1] "Car"

gsub unable to remove empty brackets in R

I have the following string in R
A<-"A (23) 56 hh()"
I want to get the following output
"A (23) 56 hh"
I tried the following code
B<-gsub(pattern = "()", replacement = "", x = A)
That didnt yield the desired result. How can I accomplish the same
Try fixed = TRUE in gsub
> gsub("()", "", A, fixed = TRUE)
[1] "A (23) 56 hh"
Using str_remove
library(stringr)
str_remove_all(A, fixed("()"))
-ouptut
[1] "A (23) 56 hh"
try B<-gsub(pattern = "\\(\\)", replacement = "", x = A)
\\ indicates that it is a specific character - not the regex expression in brackets
dy_by and ThomasIsCoding have good answers. Here is a modification of dy_by's answer
gsub(pattern = "\\()", replacement = "", x = A)
[1] "A (23) 56 hh"
Another option defining removal of two consecutive parenthesis chars, which obviates the need for fixed=TRUE:
library(stringr)
A %>% str_remove("[()]{2}")
[1] "A (23) 56 hh"

Locate position of first number in string [R]

How can I create a function in R that locates the word position of the first number in a string?
For example:
string1 <- "Hello I'd like to extract where the first 1010 is in this string"
#desired_output for string1
9
string2 <- "80111 is in this string"
#desired_output for string2
1
string3 <- "extract where the first 97865 is in this string"
#desired_output for string3
5
I would just use grep and strsplit here for a base R option:
sapply(input, function(x) grep("\\d+", strsplit(x, " ")[[1]]))
Hello I'd like to extract where the first 1010 is in this string
9
80111 is in this string
1
extract where the first 97865 is in this string
5
Data:
input <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is a way to return your desired output:
library(stringr)
min(which(!is.na(suppressWarnings(as.numeric(str_split(string, " ", simplify = TRUE))))))
This is how it works:
str_split(string, " ", simplify = TRUE) # converts your string to a vector/matrix, splitting at space
as.numeric(...) # tries to convert each element to a number, returning NA when it fails
suppressWarnings(...) # suppresses the warnings generated by as.numeric
!is.na(...) # returns true for the values that are not NA (i.e. the numbers)
which(...) # returns the position for each TRUE values
min(...) # returns the first position
The output:
min(which(!is.na(suppressWarnings(as.numeric(str_split(string1, " ", simplify = TRUE))))))
[1] 9
min(which(!is.na(suppressWarnings(as.numeric(str_split(string2, " ", simplify = TRUE))))))
[1] 1
min(which(!is.na(suppressWarnings(as.numeric(str_split(string3, " ", simplify = TRUE))))))
[1] 5
Here I'll leave a fully tidyverse approach:
library(purrr)
library(stringr)
map_dbl(str_split(strings, " "), str_which, "\\d+")
#> [1] 9 1 5
map_dbl(str_split(strings[1], " "), str_which, "\\d+")
#> [1] 9
Note that it works both with one and multiple strings.
Where strings is:
strings <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is another approach. We can trim off the remaining characters after the first digit of the first number. Then, just find the position of the last word. \\b matches word boundaries while \\S+ matches one or more non-whitespace characters.
first_numeric_word <- function(x) {
x <- substr(x, 1L, regexpr("\\b\\d+\\b", x))
lengths(gregexpr("\\b\\S+\\b", x))
}
Output
> first_numeric_word(x)
[1] 9 1 5
Data
x <- c(
"Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string"
)
Here is a base solution using rapply() w/ grep() to recurse through the results of strsplit() and works with a vector of strings.
Note: swap " " and fixed = TRUE with "\\s+" and fixed = FALSE (the default) if you want to split the strings on any whitespace instead of a literal space.
rapply(strsplit(strings, " ", fixed = TRUE), function(x) grep("[0-9]+", x))
[1] 9 1 5
Data:
strings = c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string", "extract where the first 97865 is in this string")
Try the following:
library(stringr)
position_first_number <- function(string) {
min(which(str_detect(str_split(string, "\\s+", simplify = TRUE), "[0-9]+")))
}
With your example strings:
> string1 <- "Hello I'd like to extract where the first 1010 is in this string"
> position_first_number(string1)
[1] 9
> string2 <- "80111 is in this string"
> position_first_number(string2)
[1] 1
> string3 <- "extract where the first 97865 is in this string"
> position_first_number(string3)
[1] 5

R: Delete first and last part of string based on pattern

This string is a ticker for a bond: OAT 3 25/32 7/17/17. I want to extract the coupon rate which is 3 25/32 and is read as 3 + 25/32 or 3.78125. Now I've been trying to delete the date and the name OAT with gsub, however I've encountered some problems.
This is the code to delete the date:
tkr.bond <- 'OAT 3 25/32 7/17/17'
tkr.ptrn <- '[0-9][[:punct:]][0-9][[:punct:]][0-9]'
gsub(tkr.ptrn, "", tkr.bond)
However it gets me the same string. When I use [0-9][[:punct:]][0-9] in the pattern I manage to delete part of the date, however it also deletes the fraction part of the coupon rate for the bond.
The tricky thing is to find a solution that doesn't involve the pattern of the coupon because the tickers have this form: Name Coupon Date, so, using a specific pattern for the coupon may limit the scope of the solution. For example, if the ticker is this way OAT 0 7/17/17, the coupon is zero.
Just replace first and last word with an empty string.
> tkr.bond <- 'OAT 3 25/32 7/17/17'
> gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond)
[1] "3 25/32"
OR
Use gsubfn function in-order to use a function in the replacement part.
> gsubfn("^\\S+\\s+(\\d+)\\s+(\\d+)/(\\d+).*", ~ as.numeric(x) + as.numeric(y)/as.numeric(z), tkr.bond)
[1] "3.78125"
Update:
> tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
> m <- gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond1)
> gsubfn(".+", ~ eval(parse(text=x)), gsub("\\s+", "+", m))
[1] "3.78125" "0"
Try
eval(parse(text=sub('[A-Z]+ ([0-9]+ )([0-9/]+) .*', '\\1 + \\2', tkr.bond)))
#[1] 3.78125
Or you may need
sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond)
#[1] "3 25/32"
Update
tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
v1 <- sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond1)
unname(sapply(sub(' ', '+', v1), function(x) eval(parse(text=x))))
#[1] 3.78125 0.00000
Or
vapply(strsplit(tkr.bond1, ' '), function(x)
eval(parse(text= paste(x[-c(1, length(x))], collapse="+"))), 0)
#[1] 3.78125 0.00000
Or without the eval(parse
vapply(strsplit(gsub('^[^ ]+ | [^ ]+$', '', tkr.bond1), '[ /]'), function(x) {
x1 <- as.numeric(x)
sum(x1[1], x1[2]/x1[3], na.rm=TRUE)}, 0)
#[1] 3.78125 0.00000
Similar to akrun's answer, using sub with a replacement. How it works: you put your "desired" pattern inside parentheses and leave the rest out (while still putting regex characters to match what's there and that you don't wish to keep). Then when you say replacement = "\\1" you indicate that the whole string must be substituted by only what's inside the parentheses.
sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
# [1] "3 25/32"
Then you can change it to numerical:
temp <- sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
eval(parse(text=sub(" ","+",x = temp)))
# [1] 3.78125
You can also use strsplit here. Then evaluate components excluding the first and the last. Like this
> tickers <- c('OAT 3 25/32 7/17/17', 'OAT 0 7/17/17')
>
> unlist(lapply(lapply(strsplit(tickers, " "),
+ function(x) {x[-length(x)][-1]}),
+ function(y) {sum(
+ sapply(y, function (z) {eval(parse(text = z))}) )} ) )
[1] 3.78125 0.00000

Resources