Replace multiple strings in one gsub() or chartr() statement in R?

Replace multiple strings in one gsub() or chartr() statement in R? - r

I have a string variable containing alphabet[a-z], space[ ], and apostrophe['],eg. x <- "a'b c"
I want to replace apostrophe['] with blank[], and replace space[ ] with underscore[_].
x <- gsub("'", "", x)
x <- gsub(" ", "_", x)
It works absolutely, but when I have a lot of condition, the code becomes ugly. Therefore, I want to use chartr(), but chartr() can't deal with blank, eg.
x <- chartr("' ", "_", x)
#Error in chartr("' ", "_", "a'b c") : 'old' is longer than 'new'
Is there any way to solve this problem? thanks!

You can use gsubfn
library(gsubfn)
gsubfn(".", list("'" = "", " " = "_"), x)
# [1] "ab_c"
Similarly, we can also use mgsub which allows multiple replacement with multiple pattern to search
mgsub::mgsub(x, c("'", " "), c("", "_"))
#[1] "ab_c"

I am a fan of the syntax that the %<>% and %>% opperators from the magrittr package provide.
library(magrittr)
x <- "a'b c"
x %<>%
gsub("'", "", .) %>%
gsub(" ", "_", .)
x
##[1] "ab_c"
gusbfn is wonderful, but I like the chaining %>% allows.

I'd go with the quite fast function stri_replace_all_fixed from library(stringi):
library(stringi)
stri_replace_all_fixed("a'b c", pattern = c("'", " "), replacement = c("", "_"), vectorize_all = FALSE)
Here is a benchmark taking into account most of the other suggested solutions:
library(stringi)
library(microbenchmark)
library(gsubfn)
library(mgsub)
library(magrittr)
library(dplyr)
x_gsubfn <-
x_mgsub <-
x_nested_gsub <-
x_magrittr <-
x_stringi <- "a'b c"
microbenchmark("gsubfn" = { gsubfn(".", list("'" = "", " " = "_"), x_gsubfn) },
"mgsub" = { mgsub::mgsub(x_mgsub, c("'", " "), c("", "_")) },
"nested_gsub" = { gsub("Find", "Replace", gsub("Find","Replace", x_nested_gsub)) },
"magrittr" = { x_magrittr %<>% gsub("'", "", .) %>% gsub(" ", "_", .) },
"stringi" = { stri_replace_all_fixed(x_stringi, pattern = c("'", " "), replacement = c("", "_"), vectorize_all = FALSE) }
)
Unit: microseconds
expr min lq mean median uq max neval
gsubfn 458.217 482.3130 519.12820 513.3215 538.0100 715.371 100
mgsub 180.521 200.8650 221.20423 216.0730 231.6755 460.587 100
nested_gsub 14.615 15.9980 17.92178 17.7760 18.7630 40.687 100
magrittr 113.765 133.7125 148.48202 142.9950 153.0680 296.261 100
stringi 3.950 7.7030 8.41780 8.2960 9.0860 26.071 100

I know it is a bit old but it is hard to pass on an efficient base R solution. Just use the pipe:
test <- "abcegdfk461mnb"
test2 <- gsub("e|4|6","",test)
print(test2)

I think nested gsub will do the job.
gsub("Find","Replace",gsub("Find","Replace",X))

I would opt for a magrittr and/or dplyr solution, as well. However, I prefer not making a new copy of the object, especially if it is in a function and can be returned cheaply.
i.e.
return(
catInTheHat %>% gsub('Thing1', 'Thing2', .) %>% gsub('Red Fish', 'Blue
Fish', .)
)
...and so on.

gsub("\\s", "", chartr("' ", " _", x)) # Use whitespace and then remove it

Try this replace multi text character in column:
df$TYPE <- str_replace_all(df$TYPE, c("test" = "new_test", "G" = "N", "T" = "W"))

I use this function, which also allows omitting the argument for the replacement if the replacement is empty:
s=function(x,...,ignore.case=F,perl=F,fixed=F,useBytes=F){
a=match.call(expand.dots=F)$...
l=length(a)
for(i in seq(1,l,2))x=gsub(a[[i]],if(i==l)""else a[[i+1]],x,ignore.case=ignore.case,perl=perl,fixed=fixed,useBytes=useBytes)
x
}
> s("aa bb cc","aa","dd","bb")
[1] "dd cc"

Related

How can I get the output of this function to print onto different lines in R?

So, I am writing a function that, among many other things, is supposed to keep only the first sentence from each paragraph of a text and preserve the paragraph structure (i.e. each sentence is in its own line). Here is the code that I have so far:
text_shortener <- function(input_text) {
lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1])
first.sentences <- unlist(lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1]))
no.spaces <- gsub(pattern = "(?<=[\\s])\\s*|^\\s+|\\s+$", replacement = "", x = first.sentences, perl = TRUE)
stopwords <- c("the", "really", "truly", "very", "The", "Really", "Truly", "Very")
x <- unlist(strsplit(no.spaces, " "))
no.stopwords <- paste(x[!x %in% stopwords], collapse = " ")
final.text <- gsub(pattern = "(?<=\\w{5})\\w+", replacement = ".", x = no.stopwords, perl=TRUE)
return(final.text)
}
All of the functions are working as they should, but the one part I can't figure out is how to get the output to print onto separate lines. When I run the function with a vector of text (I was using some text from Moby Dick as a test), this is what I get:
> text_shortener(Moby_Dick)
[1] "Call me Ishma. It is a way I have of drivi. off splee., and regul. circu. This is my subst. for pisto. and ball"
What I want is for the output of this function to look like this:
[1] "Call me Ishma."
[2] "It is a way I have of drivi. off splee., and regul. circu."
[3] "This is my subst. for pisto. and ball"
I am relatively new to R and this giving me a real headache, so any help would be much appreciated! Thank you!

Looking at your output, it seems like splitting on a period followed by a capital letter if what you need.
You could accomplish that with strsplit() and split the string up like so:
strsplit("Call me Ishma. It is drivi. off splee., and regul. circu. This is my subst. for pisto.","\\. (?=[A-Z])", perl=T)
That finds instances where a period is followed by a space and a capital letter and splits the character up there.
Edit: You could add it to the end of your function like so:
text_shortener <- function(input_text) {
lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1])
first.sentences <- unlist(lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1]))
no.spaces <- gsub(pattern = "(?<=[\\s])\\s*|^\\s+|\\s+$", replacement = "", x = first.sentences, perl = TRUE)
stopwords <- c("the", "really", "truly", "very", "The", "Really", "Truly", "Very")
x <- unlist(strsplit(no.spaces, " "))
no.stopwords <- paste(x[!x %in% stopwords], collapse = " ")
trim.text <- gsub(pattern = "(?<=\\w{5})\\w+", replacement = ".", x = no.stopwords, perl=TRUE)
final.text <- strsplit(trim.text, "\\. (?=[A-Z])", perl=T)
return(final.text)
}

Solve shorten notation by regular expression

I want to solve two shorten notation in R.
For Ade/i, I should get Ade, Adi
For Do(i)lfal, I should get Dolfal, Doilfal
I have this solution
b='Do(i)lferl'
gsub(pattern = '(\\w+)\\((\\w+)+\\)', replacement='\\1\\i,\\1\\2', x=b)
Can anyone help me to code this

If these values are part of a dataframe, you can do this:
df <- data.frame(
Nickname = c("Ade/i", "Do(i)lfal")
)
df$Nickname_new[1] <- paste0(sub("(?=.*/)(.*)/.*", "\\1", df$Nickname[1], perl = T), ",", paste0(unlist(str_split(df$Nickname[1], "\\w/")), collapse = ""))
df$Nickname_new[2] <- paste0(sub("(.*)(\\(.*\\))(.*)", "\\1\\3", df$Nickname[2]),",", sub("(.*)\\((\\w)\\)(.*)", "\\1\\2\\3\\4", df$Nickname[2]))
which gives you:
df
Nickname Nickname_new
1 Ade/i Ade,Adi
2 Do(i)lfal Dolfal,Doilfal
EDIT:
Just in case the whole thing is not part of a dataframe but an atomic vector, you can do this:
x <- c("Ade/i", "Do(i)lfal")
c(paste0(sub("/.*", "", x[grepl("/", x)]), ", ", sub("./", "", x[grepl("/", x)])),
paste0(sub("(.*)\\((\\w)\\)(.*)", "\\1\\2\\3\\4", x[grepl("\\(", x)]), ", ", sub("\\(\\w\\)", "", x[grepl("\\(", x)])))
which gives you:
[1] "Ade, Adi" "Doilfal, Dolfal"
If there are values that you don't want to change, then this regex by #Wiktor will work (it works even with any scenario!):
x <- c("Ade/i", "Do(i)lfal", "Peter", "Mary")
gsub('(\\w*)\\((\\w+)\\)(\\w*)', '\\1\\2\\3, \\1\\3', gsub("(\\w*)(\\w)/(\\w)\\b", "\\1\\2, \\1\\3", x))
which gives you:
[1] "Ade, Adi" "Doilfal, Dolfal" "Peter" "Mary"

Iterating over words across vector of strings and applying change to single word

Given string:
words <- c("fauuucet water", "tap water")
I would like to apply toupper function to all words that contain u.
Desired results
res <- c("FAUUCET water", "tap water")
Function
change_u_case <- function(str) {
sapply(
X = str,
FUN = function(search_term) {
sapply(
X = strsplit(search_term, split = "\\s", perl = TRUE),
FUN = function(word) {
if (grepl(pattern = "u", x = word)) {
toupper(word)
}
}
,
USE.NAMES = FALSE
)
},
USE.NAMES = FALSE
)
}
Tests
change_u_case(words) -> tst_res
words
tst_res
unlist(tst_res)
Notes
In particular, I'm interested whether solution using single rapply call could be build
rlist::list.iter approach would be also interesting
The selection of words containing u character is an example, in practice I would be looking to apply various conditions reflecting length and so on

You can use a single sapply call, i.e.
sapply(strsplit(words, ' '), function(i) {i1 <- grepl('u', i);
i[i1] <- toupper(i[i1]);
paste0(i, collapse = ' ')
})
#[1] "FAUUUCET water" "tap water"

Here is a stringi-based solution:
library(stringi);
sapply(stri_extract_all_words(words),
function(w) paste(ifelse(stri_detect(w, regex = "u"), toupper(w), w), collapse = " "))
#[1] "FAUUUCET water" "tap water"

Try stringr:
str_replace_all(words, '\\w*u\\w*', toupper)
# [1] "FAUUUCET water" "tap water"
More examples:
str_replace_all(c('Upset', 'day day up'), '\\w*u\\w*', toupper)
# [1] "Upset" "day day UP"

String Dataframe R [duplicate]

I have data that looks like this:
vector = c("hello I like to code hello","Coding is fun", "fun fun fun")
I want to remove duplicate words (space delimited) i.e. the output should look like
vector_cleaned
[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"

Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:
vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun" "fun"
## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))
Update based on comments
You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.
myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
a <- if (isTRUE(onlyUnique)) unique(x) else x
paste(a[nchar(a) > minLen], collapse = " ")
}
Compare the output of the following to see how it would work.
vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)

I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:
library(tidyverse)
df <- data.frame(vector = c("hello I like to code hello",
"Coding is fun",
"fun fun fun"))
df %>%
mutate(split = str_split(vector, " ")) %>% # split
mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine
Result:
#> vector split
#> 1 hello I like to code hello hello I like to code
#> 2 Coding is fun Coding is fun
#> 3 fun fun fun fun
Created on 2021-01-03 by the reprex package (v0.3.0)

Using tidyverse
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_longer_delim(vector, delim = regex("\\s+")) %>%
distinct() %>%
reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>%
select(-rn)
-output
vector
1 hello I like to code
2 Coding is fun
3 fun

How do keep only unique words within each string in a vector

I have data that looks like this:
vector = c("hello I like to code hello","Coding is fun", "fun fun fun")
I want to remove duplicate words (space delimited) i.e. the output should look like
vector_cleaned
[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"

Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:
vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun" "fun"
## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))
Update based on comments
You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.
myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
a <- if (isTRUE(onlyUnique)) unique(x) else x
paste(a[nchar(a) > minLen], collapse = " ")
}
Compare the output of the following to see how it would work.
vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)

I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:
library(tidyverse)
df <- data.frame(vector = c("hello I like to code hello",
"Coding is fun",
"fun fun fun"))
df %>%
mutate(split = str_split(vector, " ")) %>% # split
mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine
Result:
#> vector split
#> 1 hello I like to code hello hello I like to code
#> 2 Coding is fun Coding is fun
#> 3 fun fun fun fun
Created on 2021-01-03 by the reprex package (v0.3.0)

Using tidyverse
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_longer_delim(vector, delim = regex("\\s+")) %>%
distinct() %>%
reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>%
select(-rn)
-output
vector
1 hello I like to code
2 Coding is fun
3 fun

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace multiple strings in one gsub() or chartr() statement in R? - r

You can use gsubfn library(gsubfn) gsubfn(".", list("'" = "", " " = "_"), x) # [1] "ab_c" Similarly, we can also use mgsub which allows multiple replacement with multiple pattern to search mgsub::mgsub(x, c("'", " "), c("", "_")) #[1] "ab_c"

I am a fan of the syntax that the %<>% and %>% opperators from the magrittr package provide. library(magrittr) x <- "a'b c" x %<>% gsub("'", "", .) %>% gsub(" ", "_", .) x ##[1] "ab_c" gusbfn is wonderful, but I like the chaining %>% allows.

I know it is a bit old but it is hard to pass on an efficient base R solution. Just use the pipe: test <- "abcegdfk461mnb" test2 <- gsub("e|4|6","",test) print(test2)

I think nested gsub will do the job. gsub("Find","Replace",gsub("Find","Replace",X))

I would opt for a magrittr and/or dplyr solution, as well. However, I prefer not making a new copy of the object, especially if it is in a function and can be returned cheaply. i.e. return( catInTheHat %>% gsub('Thing1', 'Thing2', .) %>% gsub('Red Fish', 'Blue Fish', .) ) ...and so on.

gsub("\\s", "", chartr("' ", " _", x)) # Use whitespace and then remove it

Try this replace multi text character in column: df$TYPE <- str_replace_all(df$TYPE, c("test" = "new_test", "G" = "N", "T" = "W"))

Related

How can I get the output of this function to print onto different lines in R?

Solve shorten notation by regular expression

Iterating over words across vector of strings and applying change to single word

String Dataframe R [duplicate]

How do keep only unique words within each string in a vector

Categories

Resources