How to remove n number of identical characters from string in R

How to remove n number of identical characters from string in R - r

I have a string in R where the words are interspaced with a random number of character \n:
mystring = c("hello\n\ni\n\n\n\n\nam\na\n\n\n\n\n\n\ndog")
I want to replace n number of repeating \n elements so that there is only a space character between words. I can currently do this as follows, but I want a tidier solution:
mystring %>%
gsub("\n\n", "\n", .) %>%
gsub("\n\n", "\n", .) %>%
gsub("\n\n", "\n", .) %>%
gsub("\n", " ", .)
[1] "hello i am a dog"
What is the best way to achieve this?

We can use + to signify one or more repetitions
gsub("\n+", " ", mystring)
[1] "hello i am a dog"

We could use same logic as akrun with str_replace_all:
library(stringr)
str_replace_all(mystring, '\n+', ' ')
[1] "hello i am a dog"

In this case, you might find str_squish() convenient. This is intended to solve this exact problem, while the other solutions show good ways to solve the more general case.
library(stringr)
mystring = c("hello\n\ni\n\n\n\n\nam\na\n\n\n\n\n\n\ndog")
str_squish(mystring)
# [1] "hello i am a dog"
If you look at the code of str_squish(), it is basically wrapper around str_replace_all().
str_squish
function (string)
{
stri_trim_both(str_replace_all(string, "\\s+", " "))
}

Another possible solution, based on stringr::str_squish:
library(stringr)
str_squish(mystring)
#> [1] "hello i am a dog"

Related

Extract string between exact word and pattern using stringr

I have been wondering how to extract string in R using stringr or another package between the exact word "to the" (which is always lowercase) and the very second comma in a sentence.
For instance:
String: "This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want"
Desired output: "THIS IS WHAT I WANT, DO YOU SEE IT?"
I have this vector:
x<-c("This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want",
"HYU_IO TO TO to the I WANT, THIS, this i dont, want", "uiui uiu to the xxxx,,this is not, what I want")
and I am trying to use this code
str_extract(string = x, pattern = "(?<=to the ).*(?=\\,)")
but I cant seem to get it to work to properly give me this:
"THIS IS WHAT I WANT, DO YOU SEE IT?"
"I WANT, THIS"
"xxxx,"
Thank you guys so much for your time and help

You were close!
str_extract(string = x, pattern = "(?<=to the )[^,]*,[^,]*")
# [1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
# [2] "I WANT, THIS"
# [3] "xxxx,"
The look-behind stays the same, [^,]* matches anything but a comma, then , matches exactly one comma, then [^,]* again for anything but a comma.

Alternative approach, by far not comparable with Gregor Thomas approach, but somehow an alternative:
vector to tibble
separate twice by first to the then by ,
paste together
pull for vector output.
library(tidyverse)
as_tibble(x) %>%
separate(value, c("a", "b"), sep = 'to the ') %>%
separate(b, c("a", "c"), sep =",") %>%
mutate(x = paste0(a, ",", c), .keep="unused") %>%
pull(x)
[1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
[2] "I WANT, THIS"
[3] "xxxx,"

stringr - remove multiple spaces, but keep linebreaks (\n, \r)

I am working on some raw text and want to replace all multiple spaces with one space. Normally, would use stringr's str_squish, but unfortunately it also removes linebreaks (\n and \r) which I have to keep.
Any idea? Below my attempts. Many thanks!
library(tidyverse)
x <- "hello \n\r how are you \n\r all good?"
str_squish(x)
#> [1] "hello how are you all good?"
str_replace_all(x, "[:space:]+", " ")
#> [1] "hello how are you all good?"
str_replace_all(x, "\\s+", " ")
#> [1] "hello how are you all good?"
Created on 2020-07-01 by the reprex package (v0.3.0)

With stringr, you may use \h shorthand character class to match any horizontal whitespaces.
library(stringr)
x <- "hello \n\r how are you \n\r all good?"
x <- str_replace_all(x, "\\h+", " ")
## [1] "hello \n\r how are you \n\r all good?"
In base R, you may use it, too, with a PCRE pattern:
gsub("\\h+", " ", x, perl=TRUE)
See the online R demo.
If you plan to still match any whitespace (including some Unicode line breaks) other than CR and LF symbols, you may plainly use [^\S\r\n] pattern:
str_replace_all(x, "[^\\S\r\n]+", " ")
gsub("[^\\S\r\n]+", " ", x, perl=TRUE)

You could just us a literal space in the regex instead of \\s or [:space:]:
str_replace_all(x, " +", " ") %>%
cat()
hello
how are you
all good?
You can also include tabs by using [ \t], [:blank:], or \\h instead of . In this case, you may want to use {2,} to select 2 or more of the same selector so you don't have to write the pattern twice (ie. [:blank:][:blank:]+):
y <- "hello \n\r\t\thow are you \n\r all good?"
str_replace_all(y, "[:blank:]{2,}", " ") %>%
cat()
hello
how are you
all good?

Why does paste() concatenate list elements in the wrong order?

Given the following string:
my.str <- "I welcome you my precious dude"
One splits it:
my.splt.str <- strsplit(my.str, " ")
And then concatenates:
paste(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6], sep = " ")
The result is:
[1] "I you precious" "welcome my dude"
When not using the colon operator it returns the correct order:
paste(my.splt.str[[1]][1], my.splt.str[[1]][2], my.splt.str[[1]][3], my.splt.str[[1]][4], my.splt.str[[1]][5], my.splt.str[[1]][6], sep = " ")
[1] "I welcome you my precious dude"
Why is this happening?

paste is designed to work with vectors element-by-element. Say you did this:
names <- c('Alice', 'Bob', 'Charlie')
paste('Hello', names)
You'd want to result to be [1] "Hello Alice" "Hello Bob" "Hello Charlie", rather than "Hello Hello Hello Alice Bob Charlie".
To make it work like you want it to, rather than giving the different sections to paste as separate arguments, you could first combine them into a single vector with c:
paste(c(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6]), collapse = " ")
## [1] "I welcome you my precious dude"

We can use collapse instead of sep
paste(my.splt.str[[1]], collapse= ' ')
If we use the first approach by OP, it is pasteing the corresponding elements from each of the subset
If we want to selectively paste, first create an object because the [[ repeat can be avoided
v1 <- my.splt.str[[1]]
v1[3:4] <- toupper(v1[3:4])
paste(v1, collapse=" ")
#[1] "I welcome YOU MY precious dude"
When we have multiple arguments in paste, it is doing the paste on the corresponding elements of it
paste(v1[1:2], v1[3:4])
#[1] "I you" "welcome my"
If we use collapse, then it would be a single string, but still the order is different because the first element of v1[1:2] is pasteed with the first element of v1[3:4] and 2nd with the 2nd element
paste(v1[1:2], v1[3:4], collapse = ' ')
#[1] "I you welcome my"
It is documented in ?paste
paste converts its arguments (via as.character) to character strings, and concatenates them (separating them by the string given by sep). If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
Also, converting to uppercase can be done on a substring without splitting as well
sub("^(\\w+\\s+\\w+)\\s+(\\w+\\s+\\w+)", "\\1 \\U\\2", my.str, perl = TRUE)
#[1] "I welcome YOU MY precious dude"

Removing multiple words from a string using a vector instead of regexp in R

I would like to remove multiple words from a string in R, but would like to use a character vector instead of a regexp.
For example, if I had the string
"hello how are you"
and wanted to remove
c("hello", "how")
I would return
" are you"
I can get close with str_remove() from stringr
"hello how are you" %>% str_remove(c("hello","how"))
[1] "how are you" "hello are you"
But I'd need to do something to get this down into a single string. Is there a function that does all of this on one call?

We can use | to evaluate as a regex OR
library(stringr)
library(magrittr)
pat <- str_c(words, collapse="|")
"hello how are you" %>%
str_remove_all(pat) %>%
trimws
#[1] "are you"
data
words <- c("hello", "how")

A base R possibility could be:
x <- "hello how are you"
trimws(gsub("hello|how", "\\1", x))
[1] "are you"
Or if you have more words, a clever idea proposed by #Wimpel:
words <- paste(c("hello", "how"), collapse = "|")
trimws(gsub(words, "\\1", x))

Extract names and covert to email addresses in R

I have the following string: " John Andrew Thomas"(4 empty spaces before John) and I need to split and concat it so my output is "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com", also I need to remove all whitespaces.
My best guess is:
test = unlist(lapply(names, strsplit, split = " ", fixed = FALSE))
paste(test, collapse = "#gmail.com")
but I get this as an output:
"#gmail.com#gmail.com#gmail.com#gmail.comJohn#gmail.comAndrew#gmail.comThomas"

names <- " John Andrew Thomas"
test <- unlist(lapply(names, strsplit, split = " ", fixed = FALSE))
paste(test[test != ""],"#gmail.com",sep = "",collapse = ";")
A small tweak to your paste line will remove the extra spaces and separate the email addresses with a semicolon.
Output is the following:
[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"

With stringr, so we can use its str_trim function to deal with your leading whitespace, and assuming your string is x:
library(stringr)
paste(sapply(str_split(str_trim(x), " "), function(i) sprintf("%s#gmail.com", i)), collapse = ";")
And here's a piped version, so it's easier to follow:
library(dplyr)
library(stringr)
x %>%
# get rid of leading and trailing whitespace
str_trim() %>%
# make a list with the elements of the string, split at " "
str_split(" ") %>%
# get an array of strings where those list elements are added to a fixed chunk via sprintf
sapply(., function(i) sprintf("%s#gmail.com", i)) %>%
# concatenate the resulting array into a single string with semicolons
paste(., collapse = ";")

Another approach using trimws function of base R
paste0(unlist(strsplit(trimws(names)," ")),"#gmail.com",collapse = ";")
#[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
Data
names <- " John Andrew Thomas"

Another idea using stringi:
v <- " John Andrew Thomas"
paste0(stringi::stri_extract_all_words(v, simplify = TRUE), "#gmail.com", collapse = ";")
Which gives:
#[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"

You can use gsub(), and a little creativity.
x <- " John Andrew Thomas"
paste0(gsub(" ", "#gmail.com;", trimws(x)), "#gmail.com")
# [1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
No packages, no loops, and no string splitting.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove n number of identical characters from string in R - r

We can use + to signify one or more repetitions gsub("\n+", " ", mystring) [1] "hello i am a dog"

We could use same logic as akrun with str_replace_all: library(stringr) str_replace_all(mystring, '\n+', ' ') [1] "hello i am a dog"

Another possible solution, based on stringr::str_squish: library(stringr) str_squish(mystring) #> [1] "hello i am a dog"

Related

Extract string between exact word and pattern using stringr

stringr - remove multiple spaces, but keep linebreaks (\n, \r)

Why does paste() concatenate list elements in the wrong order?

Removing multiple words from a string using a vector instead of regexp in R

Extract names and covert to email addresses in R

Categories

Resources