stringr - remove multiple spaces, but keep linebreaks (\n, \r) - r

I am working on some raw text and want to replace all multiple spaces with one space. Normally, would use stringr's str_squish, but unfortunately it also removes linebreaks (\n and \r) which I have to keep.
Any idea? Below my attempts. Many thanks!
library(tidyverse)
x <- "hello \n\r how are you \n\r all good?"
str_squish(x)
#> [1] "hello how are you all good?"
str_replace_all(x, "[:space:]+", " ")
#> [1] "hello how are you all good?"
str_replace_all(x, "\\s+", " ")
#> [1] "hello how are you all good?"
Created on 2020-07-01 by the reprex package (v0.3.0)

With stringr, you may use \h shorthand character class to match any horizontal whitespaces.
library(stringr)
x <- "hello \n\r how are you \n\r all good?"
x <- str_replace_all(x, "\\h+", " ")
## [1] "hello \n\r how are you \n\r all good?"
In base R, you may use it, too, with a PCRE pattern:
gsub("\\h+", " ", x, perl=TRUE)
See the online R demo.
If you plan to still match any whitespace (including some Unicode line breaks) other than CR and LF symbols, you may plainly use [^\S\r\n] pattern:
str_replace_all(x, "[^\\S\r\n]+", " ")
gsub("[^\\S\r\n]+", " ", x, perl=TRUE)

You could just us a literal space in the regex instead of \\s or [:space:]:
str_replace_all(x, " +", " ") %>%
cat()
hello
how are you
all good?
You can also include tabs by using [ \t], [:blank:], or \\h instead of . In this case, you may want to use {2,} to select 2 or more of the same selector so you don't have to write the pattern twice (ie. [:blank:][:blank:]+):
y <- "hello \n\r\t\thow are you \n\r all good?"
str_replace_all(y, "[:blank:]{2,}", " ") %>%
cat()
hello
how are you
all good?

Related

How to remove n number of identical characters from string in R

I have a string in R where the words are interspaced with a random number of character \n:
mystring = c("hello\n\ni\n\n\n\n\nam\na\n\n\n\n\n\n\ndog")
I want to replace n number of repeating \n elements so that there is only a space character between words. I can currently do this as follows, but I want a tidier solution:
mystring %>%
gsub("\n\n", "\n", .) %>%
gsub("\n\n", "\n", .) %>%
gsub("\n\n", "\n", .) %>%
gsub("\n", " ", .)
[1] "hello i am a dog"
What is the best way to achieve this?
We can use + to signify one or more repetitions
gsub("\n+", " ", mystring)
[1] "hello i am a dog"
We could use same logic as akrun with str_replace_all:
library(stringr)
str_replace_all(mystring, '\n+', ' ')
[1] "hello i am a dog"
In this case, you might find str_squish() convenient. This is intended to solve this exact problem, while the other solutions show good ways to solve the more general case.
library(stringr)
mystring = c("hello\n\ni\n\n\n\n\nam\na\n\n\n\n\n\n\ndog")
str_squish(mystring)
# [1] "hello i am a dog"
If you look at the code of str_squish(), it is basically wrapper around str_replace_all().
str_squish
function (string)
{
stri_trim_both(str_replace_all(string, "\\s+", " "))
}
Another possible solution, based on stringr::str_squish:
library(stringr)
str_squish(mystring)
#> [1] "hello i am a dog"

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Replacing white space with one single backslash

I want to replace a white space with ONE backslash and a whitespace like this:
"foo bar" --> "foo\ bar"
I found how to replace with multiple backslashes but wasn't able to adapt it to a single backslash.
I tried this so far:
x <- "foo bar"
gsub(" ", "\\ ", x)
# [1] "foo bar"
gsub(" ", "\\\ ", x)
# [1] "foo bar"
gsub(" ", "\\\\ ", x)
# [1] "foo\\ bar"
However, all the outcomes do not satisfy my needs. I need the replacement to dynamically create file paths which contain folders with names like
/some/path/foo bar/foobar.txt.
To use them for shell commands in system() white spaces have to be exited with a \ to
/some/path/foo\ bar/foobar.txt.
Do you know how to solve this one?
Your problem is a confusion between the content of a string and its representation. When you print out a string in the ordinary way in R you will never see a single backslash (unless it's denoting a special character, e.g. print("y\n"). If you use cat() instead, you'll see only a single backslash.
x <- "foo bar"
y <- gsub(" ", "\\\\ ", x)
print(y)
## [1] "foo\\ bar"
cat(y,"\n") ## string followed by a newline
## foo\ bar
There are 8 characters in the string; 6 letters, one space, and the backslash.
nchar(y) ## 8
For comparison, consider \n (newline character).
z <- gsub(" ", "\n ", x)
print(z)
## [1] "foo\n bar"
cat(z,"\n")
## foo
## bar
nchar(z) ## 8
If you're constructing file paths, it might be easier to use forward slashes instead - forward slashes work as file separators in R on all operating systems (even Windows). Or check out file.path(). (Without knowing exactly what you're trying to do, I can't say more.)
To replace a space with one backslash and a space, you do not even need to use regular expression, use your gsub(" ", "\\ ", x) first attempt with fixed=TRUE:
> x <- "foo bar"
> res <- gsub(" ", "\\ ", x, fixed=TRUE)
> cat(res, "\n")
foo\ bar
See an online R demo
The cat function displays the "real", literal backslashes.

Removing punctuation between two words

I have a data frame (df) and I would like to remove punctuation.
However there an issue with dot between 2 words and at the end of one word like this:
test.
test1.test2
I use this to remove the punctuation:
library(tm)
removePunctuation(df)
and the result I take is this:
test
test1test2
but I would like to take this as result:
test
test1 test2
How is it possible to have a space between two words in the removing process?
You can use chartr for single character substitution:
chartr(".", " ", c("test1.test2"))
# [1] "test1 test2"
#akrun suggested trimws to remove the space at the end of your test string:
str <- c("test.", "test1.test2")
trimws(chartr(".", " ", str))
# [1] "test" "test1 test2"
We can use gsub to replace the . with a white space and remove the trailing/leading spaces (if any) with trimws.
trimws(gsub('[.]', ' ', str1))
#[1] "test" "test1 test2"
NOTE: In regex, . by itself means any character. So we should either keep it inside square brackets[.]) or escape it (\\.) or with option fixed=TRUE
trimws(gsub('.', ' ', str1, fixed=TRUE))
data
str1 <- c("test.", "test1.test2")
you can also use strsplit:
a <- "test."
b <- "test1.test2"
do.call(paste, as.list(strsplit(a, "\\.")[[1]]))
[1] "test"
do.call(paste, as.list(strsplit(b, "\\.")[[1]]))
[1] "test1 test2"

Merge Multiple spaces to single space; remove trailing/leading spaces

I want to merge multiple spaces into single space(space could be tab also) and remove trailing/leading spaces.
For example...
string <- "Hi buddy what's up Bro"
to
"Hi buddy what's up bro"
I checked the solution given at Regex to replace multiple spaces with a single space. Note that don't put \t or \n as exact space inside the toy string and feed that as pattern in gsub. I want that in R.
Note that I am unable to put multiple space in toy string.
Thanks
This seems to meet your needs.
string <- " Hi buddy what's up Bro "
library(stringr)
str_replace(gsub("\\s+", " ", str_trim(string)), "B", "b")
# [1] "Hi buddy what's up bro"
Or simply try the squish function from stringr
library(stringr)
string <- " Hi buddy what's up Bro "
str_squish(string)
# [1] "Hi buddy what's up Bro"
Another approach using a single regex:
gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", string, perl=TRUE)
Explanation (from)
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
[\s] any character of: whitespace (\n, \r,
\t, \f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
You do not need to import external libraries to perform such a task:
string <- " Hi buddy what's up Bro "
string <- gsub("\\s+", " ", string)
string <- trimws(string)
string
[1] "Hi buddy what's up Bro"
Or, in one line:
string <- trimws(gsub("\\s+", " ", string))
Much cleaner.
The qdapRegex has the rm_white function to handle this:
library(qdapRegex)
rm_white(string)
## [1] "Hi buddy what's up Bro"
You could also try clean from qdap
library(qdap)
library(stringr)
str_trim(clean(string))
#[1] "Hi buddy what's up Bro"
Or as suggested by #Tyler Rinker (using only qdap)
Trim(clean(string))
#[1] "Hi buddy what's up Bro"
For this purpose no need to load any extra libraries as the gsub() of Base r package does the work.
No need to remember those extra libraries.
Remove leading and trailing white spaces with trimws() and replace the extra white spaces using gsub() as mentioned by #Adam Erickson.
`string = " Hi buddy what's up Bro "
trimws(gsub("\\s+", " ", string))`
Here \\s+ matches one or more white spaces and gsub replaces it with single space.
To know what any regular expression is doing, do visit this link as mentioned by #Tyler Rinker.
Just copy and paste the regular expression you want to know what it is doing and this will do the rest.
Another solution using strsplit:
Splitting text into words, and, then, concatenating single words using paste function.
string <- "Hi buddy what's up Bro"
stringsplit <- sapply(strsplit(string, " "), function(x){x[!x ==""]})
paste(stringsplit ,collapse = " ")
For more than one document:
string <- c("Hi buddy what's up Bro"," an example using strsplit ")
stringsplit <- lapply(strsplit(string, " "), function(x){x[!x ==""]})
sapply(stringsplit ,function(d) paste(d,collapse = " "))
This seems to work.
It doesn't eliminate whitespaces at the beginning or the end of the sentence as Rich Scriven's answer
but, it merge multiple whitespices
library("stringr")
string <- "Hi buddy what's up Bro"
str_replace_all(string, "\\s+", " ")
#> str_replace_all(string, "\\s+", " ")
# "Hi buddy what's up Bro"

Resources