regex misunderstanding in r - r

I don't seem to understand gsub or stringr.
Example:
> a<- "a book"
> gsub(" ", ".", a)
[1] "a.book"
Okay. BUT:
> a<-"a.book"
> gsub(".", " ", a)
[1] " "
I would of expected
"a book"
I'm replacing the full stop with a space.
Also: srintr: str_replace(a, ".", " ") returns:
" .book"
and str_replace_all(a, ".", " ") returns
" "
I can use stringi: stri_replace(a, " ", fixed="."):
"a book"
I'm just wondering why gsub (and str_replace) don't act as I'd have expected. They work when replacing a space with another character, but not the other way around.

That's because the first argument to gsub, namely pattern is actually a regex. In regex the period . is a metacharacter and it matches any single character, see ?base::regex. In your case you need to escape the period in the following way:
gsub("\\.", " ", a)

Related

Why does gsub/sub not work to replace ".."?

When I call rownames on my df I get something like this:
"Saint.Petersburg..Russia" "Istanbul..Turkey"
This what I coded
gsub("..", " ", rownames(df))
This is what was returned
[1] " " " " "
What I expected was
"Saint.Petersburg Russia" "Istanbul Turkey"
Does anyone know what is going wrong here?
We can use fixed = TRUE as . can match any character in the default regex mode if it is not escaped (\\.) or placed inside square brackets ([.]) or the faster option is fixed = TRUE
gsub("..", " ", rownames(df), fixed = TRUE)
#[1] "Saint.Petersburg Russia" "Istanbul Turkey"

what is going on with my trimws?

I was fiddling around in text cleaning when I ran into an interesting occurrence.
Reproducible Code:
trimws(list(c("this is an outrante", " hahaha", " ")))
Output:
[1] "c(\"this is an outrante\", \" hahaha\", \" \")"
I've checked out the trimws documentation and it doesn't go into any specifics besides the fact that it expects a character vector, and in my case, I've supplied with a list of a list of character vectors. I know I can use lapply to easily solve this, but what I want to understand is what is going on with my trimws as is?
The trimws would be directly applied to vector and not on a list.
According to ?trimws documentation, the usage is
trimws(x, which = c("both", "left", "right"))
where
x- a character vector
It is not clear why the vector is wrapped in a list
trimws(c("this is an outrante", " hahaha", " "))
If it really needs to be in a list, then use one of the functions that goes into the list elements and apply the trimws
lapply(list(c("this is an outrante", " hahaha", " ")), trimws)
Also, note that the OP's list is a list of length 1, which can be converted back to a vector either by [[1]] or unlist (more general)
trimws(list(c("this is an outrante", " hahaha", " "))[[1]])
Regarding why a function behaves this, it is supposed to have an input argument as a vector. The behavior is similar for other functions that expect a vector, for e.g.
paste(list(c("this is an outrante", " hahaha", " ")))
as.character(list(c("this is an outrante", " hahaha", " ")))
If we check the trimws function, it is calling regex sub which requires a vector
mysub <- function(re, x) sub(re, "", x, perl = TRUE)
mysub("^[ \t\r\n]+", list(c("this is an outrante", " hahaha", " ")))
#[1] "c(\"this is an outrante\", \" hahaha\", \" \")"
Pass it a vector
mysub("^[ \t\r\n]+", c("this is an outrante", " hahaha", " "))
#[1] "this is an outrante" "hahaha" ""

substitute word separators with space

I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "
You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three
You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"
Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.
I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.

Merge Multiple spaces to single space; remove trailing/leading spaces

I want to merge multiple spaces into single space(space could be tab also) and remove trailing/leading spaces.
For example...
string <- "Hi buddy what's up Bro"
to
"Hi buddy what's up bro"
I checked the solution given at Regex to replace multiple spaces with a single space. Note that don't put \t or \n as exact space inside the toy string and feed that as pattern in gsub. I want that in R.
Note that I am unable to put multiple space in toy string.
Thanks
This seems to meet your needs.
string <- " Hi buddy what's up Bro "
library(stringr)
str_replace(gsub("\\s+", " ", str_trim(string)), "B", "b")
# [1] "Hi buddy what's up bro"
Or simply try the squish function from stringr
library(stringr)
string <- " Hi buddy what's up Bro "
str_squish(string)
# [1] "Hi buddy what's up Bro"
Another approach using a single regex:
gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", string, perl=TRUE)
Explanation (from)
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
[\s] any character of: whitespace (\n, \r,
\t, \f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
You do not need to import external libraries to perform such a task:
string <- " Hi buddy what's up Bro "
string <- gsub("\\s+", " ", string)
string <- trimws(string)
string
[1] "Hi buddy what's up Bro"
Or, in one line:
string <- trimws(gsub("\\s+", " ", string))
Much cleaner.
The qdapRegex has the rm_white function to handle this:
library(qdapRegex)
rm_white(string)
## [1] "Hi buddy what's up Bro"
You could also try clean from qdap
library(qdap)
library(stringr)
str_trim(clean(string))
#[1] "Hi buddy what's up Bro"
Or as suggested by #Tyler Rinker (using only qdap)
Trim(clean(string))
#[1] "Hi buddy what's up Bro"
For this purpose no need to load any extra libraries as the gsub() of Base r package does the work.
No need to remember those extra libraries.
Remove leading and trailing white spaces with trimws() and replace the extra white spaces using gsub() as mentioned by #Adam Erickson.
`string = " Hi buddy what's up Bro "
trimws(gsub("\\s+", " ", string))`
Here \\s+ matches one or more white spaces and gsub replaces it with single space.
To know what any regular expression is doing, do visit this link as mentioned by #Tyler Rinker.
Just copy and paste the regular expression you want to know what it is doing and this will do the rest.
Another solution using strsplit:
Splitting text into words, and, then, concatenating single words using paste function.
string <- "Hi buddy what's up Bro"
stringsplit <- sapply(strsplit(string, " "), function(x){x[!x ==""]})
paste(stringsplit ,collapse = " ")
For more than one document:
string <- c("Hi buddy what's up Bro"," an example using strsplit ")
stringsplit <- lapply(strsplit(string, " "), function(x){x[!x ==""]})
sapply(stringsplit ,function(d) paste(d,collapse = " "))
This seems to work.
It doesn't eliminate whitespaces at the beginning or the end of the sentence as Rich Scriven's answer
but, it merge multiple whitespices
library("stringr")
string <- "Hi buddy what's up Bro"
str_replace_all(string, "\\s+", " ")
#> str_replace_all(string, "\\s+", " ")
# "Hi buddy what's up Bro"

add space in string when meeting a given pattern

I have a string as follows:
a<-c("AbcDef(123)")
> a
[1] "AbcDef(123)"
Is there any efficient way to transform it as
a<-c("Abc Def (123)")
In other words, I would like to add a space when meeting a upper case or a special character ( .
one possibility :
gsub("(?<=[^A-Z(])(?=[A-Z(])", " ", a, perl=T)
Mine's a bit kludgy and uses two gsubs. The inner gsub adds spaces, the outer gsub removes the leading whitespace.
a <- "AbcDef(123)"
gsub("^\\s", "", gsub("([A-Z(])", " \\1", a))
Try this:
gsub("(?<=.)([A-Z(])", " \\1", a, perl = TRUE)
giving:
[1] "Abc Def (123)"
If the string with spaces has no one-character pieces it can be simplified to this:
gsub("(.)([A-Z(])", "\\1 \\2", a)

Resources