what is going on with my trimws? - r

I was fiddling around in text cleaning when I ran into an interesting occurrence.
Reproducible Code:
trimws(list(c("this is an outrante", " hahaha", " ")))
Output:
[1] "c(\"this is an outrante\", \" hahaha\", \" \")"
I've checked out the trimws documentation and it doesn't go into any specifics besides the fact that it expects a character vector, and in my case, I've supplied with a list of a list of character vectors. I know I can use lapply to easily solve this, but what I want to understand is what is going on with my trimws as is?

The trimws would be directly applied to vector and not on a list.
According to ?trimws documentation, the usage is
trimws(x, which = c("both", "left", "right"))
where
x- a character vector
It is not clear why the vector is wrapped in a list
trimws(c("this is an outrante", " hahaha", " "))
If it really needs to be in a list, then use one of the functions that goes into the list elements and apply the trimws
lapply(list(c("this is an outrante", " hahaha", " ")), trimws)
Also, note that the OP's list is a list of length 1, which can be converted back to a vector either by [[1]] or unlist (more general)
trimws(list(c("this is an outrante", " hahaha", " "))[[1]])
Regarding why a function behaves this, it is supposed to have an input argument as a vector. The behavior is similar for other functions that expect a vector, for e.g.
paste(list(c("this is an outrante", " hahaha", " ")))
as.character(list(c("this is an outrante", " hahaha", " ")))
If we check the trimws function, it is calling regex sub which requires a vector
mysub <- function(re, x) sub(re, "", x, perl = TRUE)
mysub("^[ \t\r\n]+", list(c("this is an outrante", " hahaha", " ")))
#[1] "c(\"this is an outrante\", \" hahaha\", \" \")"
Pass it a vector
mysub("^[ \t\r\n]+", c("this is an outrante", " hahaha", " "))
#[1] "this is an outrante" "hahaha" ""

Related

concat a SPLIT variable in R

I've been trying to split a string in R and then joining it back together but none of the tricks have worked for what I need.
!!!Important !!! My question is not a duplicate:
saving a split result into a variable and then pasting, collapsing etc is not the same as just paste a vector like this
paste(c("bla", "bla"), collapse = " ")
> paste(c("The","birch", "canoe"), collapse = ' ')
[1] "The birch canoe"
> paste(s, collapse=" ")
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
Here's the code:
I take pre-saved sentences in R
sentences[1]
and split it
s <- str_split(sentences[1])
this is what I get:
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
Now when I try to join this back together I get backslashes
toString(s)
"c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
paste produces the same result:
> paste(s)
[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
I tried using str_split_fixed and wrap it into a vector, but it joins the sentence back together with a comma, even if I ask it not to.
v <- as.vector(str_split_fixed(sentences[1], " ", 5))
toString(v, sep="")
[1] "The, birch, canoe, slid, on the smooth planks."
I thought maybe str_split_i or str_split_1 could solve it as according to the documentation in theory it should, but that's what I get when I try to use it
"could not find function "str_split_1" "
Are there any other ways to join back a string after splitting it without it producing commas or backslashes?..
See the difference between:
s <- list(c("The" , "birch" , "canoe" , "slid" , "on" , "the" , "smooth" , "planks."))
paste(s[1], collapse = " ")
#[1] "c(\"The\", \"birch\", \"canoe\", \"slid\", \"on\", \"the\", \"smooth\", \"planks.\")"
and
paste(s[[1]], collapse = " ")
#[1] "The birch canoe slid on the smooth planks."
This is because [[ will extract the vector, and [ and will keep the output as a list.

Why does paste() concatenate list elements in the wrong order?

Given the following string:
my.str <- "I welcome you my precious dude"
One splits it:
my.splt.str <- strsplit(my.str, " ")
And then concatenates:
paste(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6], sep = " ")
The result is:
[1] "I you precious" "welcome my dude"
When not using the colon operator it returns the correct order:
paste(my.splt.str[[1]][1], my.splt.str[[1]][2], my.splt.str[[1]][3], my.splt.str[[1]][4], my.splt.str[[1]][5], my.splt.str[[1]][6], sep = " ")
[1] "I welcome you my precious dude"
Why is this happening?
paste is designed to work with vectors element-by-element. Say you did this:
names <- c('Alice', 'Bob', 'Charlie')
paste('Hello', names)
You'd want to result to be [1] "Hello Alice" "Hello Bob" "Hello Charlie", rather than "Hello Hello Hello Alice Bob Charlie".
To make it work like you want it to, rather than giving the different sections to paste as separate arguments, you could first combine them into a single vector with c:
paste(c(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6]), collapse = " ")
## [1] "I welcome you my precious dude"
We can use collapse instead of sep
paste(my.splt.str[[1]], collapse= ' ')
If we use the first approach by OP, it is pasteing the corresponding elements from each of the subset
If we want to selectively paste, first create an object because the [[ repeat can be avoided
v1 <- my.splt.str[[1]]
v1[3:4] <- toupper(v1[3:4])
paste(v1, collapse=" ")
#[1] "I welcome YOU MY precious dude"
When we have multiple arguments in paste, it is doing the paste on the corresponding elements of it
paste(v1[1:2], v1[3:4])
#[1] "I you" "welcome my"
If we use collapse, then it would be a single string, but still the order is different because the first element of v1[1:2] is pasteed with the first element of v1[3:4] and 2nd with the 2nd element
paste(v1[1:2], v1[3:4], collapse = ' ')
#[1] "I you welcome my"
It is documented in ?paste
paste converts its arguments (via as.character) to character strings, and concatenates them (separating them by the string given by sep). If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
Also, converting to uppercase can be done on a substring without splitting as well
sub("^(\\w+\\s+\\w+)\\s+(\\w+\\s+\\w+)", "\\1 \\U\\2", my.str, perl = TRUE)
#[1] "I welcome YOU MY precious dude"

strsplit does not split for all elements of character vector provided to parameter "split"

The R documentation for the strsplit function states for parameter split that "If split has length greater than 1, it is re-cycled along x."
I take it to mean that if I use the following code
strsplit(x = "Whatever will be will be", split = c("ever", "be"))
..., I will get x split into "What" and "will" and "will be". This does not happen. The output is "What" and "will be will be".
Am I misinterpreting the documentation? Also, how can I get the result I desire?
The arguments in split will be recycled if also x has multiple arguments:
strsplit(x = c("Whatever will be will be","Whatever will be will be"),
split = c("ever", "be"))
[[1]]
[1] "What" " will be will be"
[[2]]
[1] "Whatever will " " will "
The behaviour I suspect you expect is achieved with a |:
strsplit(x = "Whatever will be will be", split = c("ever|be"))
[[1]]
[1] "What" " will " " will "
The split is recycled across elements of x, so that the first element of split is applied to the first element of x, the second to the second, etc. So, for example:
strsplit(x = c("Whatever will be will be", "Whatever will be will be"), split = c("ever", "be"))
[[1]]
[1] "What" " will be will be"
[[2]]
[1] "Whatever will " " will "

regex misunderstanding in r

I don't seem to understand gsub or stringr.
Example:
> a<- "a book"
> gsub(" ", ".", a)
[1] "a.book"
Okay. BUT:
> a<-"a.book"
> gsub(".", " ", a)
[1] " "
I would of expected
"a book"
I'm replacing the full stop with a space.
Also: srintr: str_replace(a, ".", " ") returns:
" .book"
and str_replace_all(a, ".", " ") returns
" "
I can use stringi: stri_replace(a, " ", fixed="."):
"a book"
I'm just wondering why gsub (and str_replace) don't act as I'd have expected. They work when replacing a space with another character, but not the other way around.
That's because the first argument to gsub, namely pattern is actually a regex. In regex the period . is a metacharacter and it matches any single character, see ?base::regex. In your case you need to escape the period in the following way:
gsub("\\.", " ", a)

Replacing strings in R

I am trying to replace strings in R in a large number of texts.
Essentially, this reproduces the format of the data from which I try to delete the '\n' parts.
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
I can do this with a loop and gsub but it takes forever. I looked at this post for a solution. So I tried: temp <- apply(document, 2, function(x) gsub("\\n", " ", fixed=TRUE)). I also used lapply, but it also gives an error message. I can't figure this out, help!
use lapply if you want to return a list
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
temp <- lapply(document, function(x) gsub("\\n", " ", x, fixed=TRUE))
##[[1]]
##[1] "This is a try-out"
##[[2]]
##[1] "And it fails"

Resources