Why does gsub/sub not work to replace ".."? - r

When I call rownames on my df I get something like this:
"Saint.Petersburg..Russia" "Istanbul..Turkey"
This what I coded
gsub("..", " ", rownames(df))
This is what was returned
[1] " " " " "
What I expected was
"Saint.Petersburg Russia" "Istanbul Turkey"
Does anyone know what is going wrong here?

We can use fixed = TRUE as . can match any character in the default regex mode if it is not escaped (\\.) or placed inside square brackets ([.]) or the faster option is fixed = TRUE
gsub("..", " ", rownames(df), fixed = TRUE)
#[1] "Saint.Petersburg Russia" "Istanbul Turkey"

Related

Remove specific string

I would like to remove this character
c("
I use this
df <- gsub("c/(/"", " ", df$text)
But I receive this error:
Error: unexpected string constant in "inliwc <- gsub("c/(/"", ""
What can I do?
You need to escape the round brackets as well as the quotes which can be done as :
temp <- 'this is ac(" string'
gsub("c\\(\"", " ", temp)
#OR use single quotes in gsub
#gsub('c\\("', " ", temp)
#[1] "this is a string"
A faster way would be to use fixed = TRUE
gsub('c("', " ", temp, fixed = TRUE)
You can also use sub if there is a single occurrence of the pattern in the string.
The opening round bracket is a regex metacharacter; in R, its literal use needs to be escaped using \\:
text <- "c("
text <- gsub("c\\(", "", text)
We can also use sub
sub('c[()]"', '', temp)
#[1] "this is a string"
data
temp <- 'this is ac(" string'

what is going on with my trimws?

I was fiddling around in text cleaning when I ran into an interesting occurrence.
Reproducible Code:
trimws(list(c("this is an outrante", " hahaha", " ")))
Output:
[1] "c(\"this is an outrante\", \" hahaha\", \" \")"
I've checked out the trimws documentation and it doesn't go into any specifics besides the fact that it expects a character vector, and in my case, I've supplied with a list of a list of character vectors. I know I can use lapply to easily solve this, but what I want to understand is what is going on with my trimws as is?
The trimws would be directly applied to vector and not on a list.
According to ?trimws documentation, the usage is
trimws(x, which = c("both", "left", "right"))
where
x- a character vector
It is not clear why the vector is wrapped in a list
trimws(c("this is an outrante", " hahaha", " "))
If it really needs to be in a list, then use one of the functions that goes into the list elements and apply the trimws
lapply(list(c("this is an outrante", " hahaha", " ")), trimws)
Also, note that the OP's list is a list of length 1, which can be converted back to a vector either by [[1]] or unlist (more general)
trimws(list(c("this is an outrante", " hahaha", " "))[[1]])
Regarding why a function behaves this, it is supposed to have an input argument as a vector. The behavior is similar for other functions that expect a vector, for e.g.
paste(list(c("this is an outrante", " hahaha", " ")))
as.character(list(c("this is an outrante", " hahaha", " ")))
If we check the trimws function, it is calling regex sub which requires a vector
mysub <- function(re, x) sub(re, "", x, perl = TRUE)
mysub("^[ \t\r\n]+", list(c("this is an outrante", " hahaha", " ")))
#[1] "c(\"this is an outrante\", \" hahaha\", \" \")"
Pass it a vector
mysub("^[ \t\r\n]+", c("this is an outrante", " hahaha", " "))
#[1] "this is an outrante" "hahaha" ""

Need to trim last character string only if is blank or "."

I have a large vector of words read from an excel file. Some of those records end with space or "." period. Only in those cases, I need to trim those chars.
Example:
"depresion" "tristeza."
"nostalgia" "preocupacion."
"enojo." "soledad "
"frustracion" "desesperacion "
"angustia." "desconocidos."
Notice some words end normal without "." or " ".
Is there a way to do that?
I have this
substr(conceptos, 1, nchar(conceptos)-1))
to test for the last character (conceptos is this long vector)
Thanks for any advise,
We can use sub to match zero or more . or spaces and replace it with blank ("")
sub("(\\.| )*$", "", v1)
#[1] "depresion" "tristeza" "nostalgia" "preocupacion" "enojo"
#[6] "soledad" "frustracion" "desesperacion"
#[9] "angustia" "desconocidos"
data
v1 <- c("depresion","tristeza.","nostalgia","preocupacion.",
"enojo.","soledad ","frustracion","desesperacion ",
"angustia.","desconocidos.")
Regular expressions are good for this:
library(stringr)
x = c("depresion", "tristeza.", "nostalgia", "preocupacion.",
"enojo.", "soledad ", "frustracion", "desesperacion ",
"angustia.", "desconocidos.")
x_replaced = str_replace(x, "(\\.|\\s)$", "")
The pattern (\\.|\\s)$ will match a . or any whitespace that occurs right at the end of the string.
Try this:
iif((mid(trim(conceptos), Len(conceptos), 1) == ".") ? substr(conceptos, 1, nchar(conceptos)-1)) : trim(conceptos))

regex misunderstanding in r

I don't seem to understand gsub or stringr.
Example:
> a<- "a book"
> gsub(" ", ".", a)
[1] "a.book"
Okay. BUT:
> a<-"a.book"
> gsub(".", " ", a)
[1] " "
I would of expected
"a book"
I'm replacing the full stop with a space.
Also: srintr: str_replace(a, ".", " ") returns:
" .book"
and str_replace_all(a, ".", " ") returns
" "
I can use stringi: stri_replace(a, " ", fixed="."):
"a book"
I'm just wondering why gsub (and str_replace) don't act as I'd have expected. They work when replacing a space with another character, but not the other way around.
That's because the first argument to gsub, namely pattern is actually a regex. In regex the period . is a metacharacter and it matches any single character, see ?base::regex. In your case you need to escape the period in the following way:
gsub("\\.", " ", a)

substitute word separators with space

I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "
You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three
You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"
Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.
I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.

Resources