Merge Multiple spaces to single space; remove trailing/leading spaces - r

I want to merge multiple spaces into single space(space could be tab also) and remove trailing/leading spaces.
For example...
string <- "Hi buddy what's up Bro"
to
"Hi buddy what's up bro"
I checked the solution given at Regex to replace multiple spaces with a single space. Note that don't put \t or \n as exact space inside the toy string and feed that as pattern in gsub. I want that in R.
Note that I am unable to put multiple space in toy string.
Thanks

This seems to meet your needs.
string <- " Hi buddy what's up Bro "
library(stringr)
str_replace(gsub("\\s+", " ", str_trim(string)), "B", "b")
# [1] "Hi buddy what's up bro"

Or simply try the squish function from stringr
library(stringr)
string <- " Hi buddy what's up Bro "
str_squish(string)
# [1] "Hi buddy what's up Bro"

Another approach using a single regex:
gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", string, perl=TRUE)
Explanation (from)
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
[\s] any character of: whitespace (\n, \r,
\t, \f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

You do not need to import external libraries to perform such a task:
string <- " Hi buddy what's up Bro "
string <- gsub("\\s+", " ", string)
string <- trimws(string)
string
[1] "Hi buddy what's up Bro"
Or, in one line:
string <- trimws(gsub("\\s+", " ", string))
Much cleaner.

The qdapRegex has the rm_white function to handle this:
library(qdapRegex)
rm_white(string)
## [1] "Hi buddy what's up Bro"

You could also try clean from qdap
library(qdap)
library(stringr)
str_trim(clean(string))
#[1] "Hi buddy what's up Bro"
Or as suggested by #Tyler Rinker (using only qdap)
Trim(clean(string))
#[1] "Hi buddy what's up Bro"

For this purpose no need to load any extra libraries as the gsub() of Base r package does the work.
No need to remember those extra libraries.
Remove leading and trailing white spaces with trimws() and replace the extra white spaces using gsub() as mentioned by #Adam Erickson.
`string = " Hi buddy what's up Bro "
trimws(gsub("\\s+", " ", string))`
Here \\s+ matches one or more white spaces and gsub replaces it with single space.
To know what any regular expression is doing, do visit this link as mentioned by #Tyler Rinker.
Just copy and paste the regular expression you want to know what it is doing and this will do the rest.

Another solution using strsplit:
Splitting text into words, and, then, concatenating single words using paste function.
string <- "Hi buddy what's up Bro"
stringsplit <- sapply(strsplit(string, " "), function(x){x[!x ==""]})
paste(stringsplit ,collapse = " ")
For more than one document:
string <- c("Hi buddy what's up Bro"," an example using strsplit ")
stringsplit <- lapply(strsplit(string, " "), function(x){x[!x ==""]})
sapply(stringsplit ,function(d) paste(d,collapse = " "))

This seems to work.
It doesn't eliminate whitespaces at the beginning or the end of the sentence as Rich Scriven's answer
but, it merge multiple whitespices
library("stringr")
string <- "Hi buddy what's up Bro"
str_replace_all(string, "\\s+", " ")
#> str_replace_all(string, "\\s+", " ")
# "Hi buddy what's up Bro"

Related

Extract word in quotes from string

I am using R to extract words from short text pieces. Specifically, I want to extract any word that appear in quotes (") from a string, but not when it appears inside brackets ().
For instance, I would like the "hello" first of the 3 strings, but not the other two:
c('"hello" world', 'hello world', '("hello") world')
Original code attempt
str_extract(x, '(?<=")[^$]+(?<=")')
You may use this regex with nested look arounds in str_extract:
(?<=(?<!\()")[^"]+(?=(?!\))")
RegEx Demo
RegEx Details:
(?<=(?<!\()"): Assert that we have a " before but don't have a ( before "
[^"]+: Match 1+ of any characters that are not "
(?=(?!\))"): Assert that we have a " after but don't have a ) after "
Code:
str_extract(x, '(?<=(?<!\\()")[^"]+(?=(?!\\))")')
or avoid double escaping by using a character class:
str_extract(x, '(?<=(?<![(])")[^"]+(?=(?![)])")')
We can use a regex lookaround
library(stringr)
ifelse(grepl('\\("', str1), NA, str_extract(str1, '(?<=")\\w+'))
#[1] "hello" NA NA
data
str1 <- c("\"hello\" world", "hello world", "(\"hello\") world")

Remove whitespace after a symbol (hyphen) in R

I'm trying to remove the hyphen that divides a word from a string. For example, the word example: "for exam- ple this".
a <- "for exam- ple this"
How could I join them?
I have tried to remove the script using this command:
str_replace_all(a, "-", "")
But I got this back:
"for exam ple this"
It does not return the word united. I have also tried this:
str_replace_all(a, "- ", "") but I get nothing.
Therefore I have thought of first removing the white spaces after a hyphen to get the following
"for exm-ple this"
and then eliminating the hyphen.
Can you help me?
Here is an option with sub where we match the - followed by zero or more spaces (\\s*) and replace with -
sub("-\\s*", "-", a)
#[1] "for exam-ple this"
If it is to remove all spaces instead of a single one, then replace with gsub
gsub("-\\s*", "-", a)
str_replace_all(a, "- ", "-")
If you are just trying to remove the whitespace after a symbol then Ricardo's answer is sufficient. If you want to remove an unknown amount of whitespace after a hyphen consider
str_replace_all(a, "- +", "-")
#[1] "for exam-ple this"
b <- "for exam- ple this"
str_replace_all(b, "- +", "-")
#[1] "for exam-ple this"
EDIT --- Explaination
The "+" is something that tells r how to match a string and is part of the regular expressions. "+" specifically means to match the preceding character (or group/set) 1 or more times. You can find out more about regular expressions here.

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

regex misunderstanding in r

I don't seem to understand gsub or stringr.
Example:
> a<- "a book"
> gsub(" ", ".", a)
[1] "a.book"
Okay. BUT:
> a<-"a.book"
> gsub(".", " ", a)
[1] " "
I would of expected
"a book"
I'm replacing the full stop with a space.
Also: srintr: str_replace(a, ".", " ") returns:
" .book"
and str_replace_all(a, ".", " ") returns
" "
I can use stringi: stri_replace(a, " ", fixed="."):
"a book"
I'm just wondering why gsub (and str_replace) don't act as I'd have expected. They work when replacing a space with another character, but not the other way around.
That's because the first argument to gsub, namely pattern is actually a regex. In regex the period . is a metacharacter and it matches any single character, see ?base::regex. In your case you need to escape the period in the following way:
gsub("\\.", " ", a)

Removing punctuation between two words

I have a data frame (df) and I would like to remove punctuation.
However there an issue with dot between 2 words and at the end of one word like this:
test.
test1.test2
I use this to remove the punctuation:
library(tm)
removePunctuation(df)
and the result I take is this:
test
test1test2
but I would like to take this as result:
test
test1 test2
How is it possible to have a space between two words in the removing process?
You can use chartr for single character substitution:
chartr(".", " ", c("test1.test2"))
# [1] "test1 test2"
#akrun suggested trimws to remove the space at the end of your test string:
str <- c("test.", "test1.test2")
trimws(chartr(".", " ", str))
# [1] "test" "test1 test2"
We can use gsub to replace the . with a white space and remove the trailing/leading spaces (if any) with trimws.
trimws(gsub('[.]', ' ', str1))
#[1] "test" "test1 test2"
NOTE: In regex, . by itself means any character. So we should either keep it inside square brackets[.]) or escape it (\\.) or with option fixed=TRUE
trimws(gsub('.', ' ', str1, fixed=TRUE))
data
str1 <- c("test.", "test1.test2")
you can also use strsplit:
a <- "test."
b <- "test1.test2"
do.call(paste, as.list(strsplit(a, "\\.")[[1]]))
[1] "test"
do.call(paste, as.list(strsplit(b, "\\.")[[1]]))
[1] "test1 test2"

Resources