Replace multiple strings in one gsub() or chartr() statement in R? - r

I have a string variable containing alphabet[a-z], space[ ], and apostrophe['],eg. x <- "a'b c"
I want to replace apostrophe['] with blank[], and replace space[ ] with underscore[_].
x <- gsub("'", "", x)
x <- gsub(" ", "_", x)
It works absolutely, but when I have a lot of condition, the code becomes ugly. Therefore, I want to use chartr(), but chartr() can't deal with blank, eg.
x <- chartr("' ", "_", x)
#Error in chartr("' ", "_", "a'b c") : 'old' is longer than 'new'
Is there any way to solve this problem? thanks!

You can use gsubfn
library(gsubfn)
gsubfn(".", list("'" = "", " " = "_"), x)
# [1] "ab_c"
Similarly, we can also use mgsub which allows multiple replacement with multiple pattern to search
mgsub::mgsub(x, c("'", " "), c("", "_"))
#[1] "ab_c"

I am a fan of the syntax that the %<>% and %>% opperators from the magrittr package provide.
library(magrittr)
x <- "a'b c"
x %<>%
gsub("'", "", .) %>%
gsub(" ", "_", .)
x
##[1] "ab_c"
gusbfn is wonderful, but I like the chaining %>% allows.

I'd go with the quite fast function stri_replace_all_fixed from library(stringi):
library(stringi)
stri_replace_all_fixed("a'b c", pattern = c("'", " "), replacement = c("", "_"), vectorize_all = FALSE)
Here is a benchmark taking into account most of the other suggested solutions:
library(stringi)
library(microbenchmark)
library(gsubfn)
library(mgsub)
library(magrittr)
library(dplyr)
x_gsubfn <-
x_mgsub <-
x_nested_gsub <-
x_magrittr <-
x_stringi <- "a'b c"
microbenchmark("gsubfn" = { gsubfn(".", list("'" = "", " " = "_"), x_gsubfn) },
"mgsub" = { mgsub::mgsub(x_mgsub, c("'", " "), c("", "_")) },
"nested_gsub" = { gsub("Find", "Replace", gsub("Find","Replace", x_nested_gsub)) },
"magrittr" = { x_magrittr %<>% gsub("'", "", .) %>% gsub(" ", "_", .) },
"stringi" = { stri_replace_all_fixed(x_stringi, pattern = c("'", " "), replacement = c("", "_"), vectorize_all = FALSE) }
)
Unit: microseconds
expr min lq mean median uq max neval
gsubfn 458.217 482.3130 519.12820 513.3215 538.0100 715.371 100
mgsub 180.521 200.8650 221.20423 216.0730 231.6755 460.587 100
nested_gsub 14.615 15.9980 17.92178 17.7760 18.7630 40.687 100
magrittr 113.765 133.7125 148.48202 142.9950 153.0680 296.261 100
stringi 3.950 7.7030 8.41780 8.2960 9.0860 26.071 100

I know it is a bit old but it is hard to pass on an efficient base R solution. Just use the pipe:
test <- "abcegdfk461mnb"
test2 <- gsub("e|4|6","",test)
print(test2)

I think nested gsub will do the job.
gsub("Find","Replace",gsub("Find","Replace",X))

I would opt for a magrittr and/or dplyr solution, as well. However, I prefer not making a new copy of the object, especially if it is in a function and can be returned cheaply.
i.e.
return(
catInTheHat %>% gsub('Thing1', 'Thing2', .) %>% gsub('Red Fish', 'Blue
Fish', .)
)
...and so on.

gsub("\\s", "", chartr("' ", " _", x)) # Use whitespace and then remove it

Try this replace multi text character in column:
df$TYPE <- str_replace_all(df$TYPE, c("test" = "new_test", "G" = "N", "T" = "W"))

I use this function, which also allows omitting the argument for the replacement if the replacement is empty:
s=function(x,...,ignore.case=F,perl=F,fixed=F,useBytes=F){
a=match.call(expand.dots=F)$...
l=length(a)
for(i in seq(1,l,2))x=gsub(a[[i]],if(i==l)""else a[[i+1]],x,ignore.case=ignore.case,perl=perl,fixed=fixed,useBytes=useBytes)
x
}
> s("aa bb cc","aa","dd","bb")
[1] "dd cc"

Related

How can I get the output of this function to print onto different lines in R?

So, I am writing a function that, among many other things, is supposed to keep only the first sentence from each paragraph of a text and preserve the paragraph structure (i.e. each sentence is in its own line). Here is the code that I have so far:
text_shortener <- function(input_text) {
lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1])
first.sentences <- unlist(lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1]))
no.spaces <- gsub(pattern = "(?<=[\\s])\\s*|^\\s+|\\s+$", replacement = "", x = first.sentences, perl = TRUE)
stopwords <- c("the", "really", "truly", "very", "The", "Really", "Truly", "Very")
x <- unlist(strsplit(no.spaces, " "))
no.stopwords <- paste(x[!x %in% stopwords], collapse = " ")
final.text <- gsub(pattern = "(?<=\\w{5})\\w+", replacement = ".", x = no.stopwords, perl=TRUE)
return(final.text)
}
All of the functions are working as they should, but the one part I can't figure out is how to get the output to print onto separate lines. When I run the function with a vector of text (I was using some text from Moby Dick as a test), this is what I get:
> text_shortener(Moby_Dick)
[1] "Call me Ishma. It is a way I have of drivi. off splee., and regul. circu. This is my subst. for pisto. and ball"
What I want is for the output of this function to look like this:
[1] "Call me Ishma."
[2] "It is a way I have of drivi. off splee., and regul. circu."
[3] "This is my subst. for pisto. and ball"
I am relatively new to R and this giving me a real headache, so any help would be much appreciated! Thank you!
Looking at your output, it seems like splitting on a period followed by a capital letter if what you need.
You could accomplish that with strsplit() and split the string up like so:
strsplit("Call me Ishma. It is drivi. off splee., and regul. circu. This is my subst. for pisto.","\\. (?=[A-Z])", perl=T)
That finds instances where a period is followed by a space and a capital letter and splits the character up there.
Edit: You could add it to the end of your function like so:
text_shortener <- function(input_text) {
lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1])
first.sentences <- unlist(lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1]))
no.spaces <- gsub(pattern = "(?<=[\\s])\\s*|^\\s+|\\s+$", replacement = "", x = first.sentences, perl = TRUE)
stopwords <- c("the", "really", "truly", "very", "The", "Really", "Truly", "Very")
x <- unlist(strsplit(no.spaces, " "))
no.stopwords <- paste(x[!x %in% stopwords], collapse = " ")
trim.text <- gsub(pattern = "(?<=\\w{5})\\w+", replacement = ".", x = no.stopwords, perl=TRUE)
final.text <- strsplit(trim.text, "\\. (?=[A-Z])", perl=T)
return(final.text)
}

Solve shorten notation by regular expression

I want to solve two shorten notation in R.
For Ade/i, I should get Ade, Adi
For Do(i)lfal, I should get Dolfal, Doilfal
I have this solution
b='Do(i)lferl'
gsub(pattern = '(\\w+)\\((\\w+)+\\)', replacement='\\1\\i,\\1\\2', x=b)
Can anyone help me to code this
If these values are part of a dataframe, you can do this:
df <- data.frame(
Nickname = c("Ade/i", "Do(i)lfal")
)
df$Nickname_new[1] <- paste0(sub("(?=.*/)(.*)/.*", "\\1", df$Nickname[1], perl = T), ",", paste0(unlist(str_split(df$Nickname[1], "\\w/")), collapse = ""))
df$Nickname_new[2] <- paste0(sub("(.*)(\\(.*\\))(.*)", "\\1\\3", df$Nickname[2]),",", sub("(.*)\\((\\w)\\)(.*)", "\\1\\2\\3\\4", df$Nickname[2]))
which gives you:
df
Nickname Nickname_new
1 Ade/i Ade,Adi
2 Do(i)lfal Dolfal,Doilfal
EDIT:
Just in case the whole thing is not part of a dataframe but an atomic vector, you can do this:
x <- c("Ade/i", "Do(i)lfal")
c(paste0(sub("/.*", "", x[grepl("/", x)]), ", ", sub("./", "", x[grepl("/", x)])),
paste0(sub("(.*)\\((\\w)\\)(.*)", "\\1\\2\\3\\4", x[grepl("\\(", x)]), ", ", sub("\\(\\w\\)", "", x[grepl("\\(", x)])))
which gives you:
[1] "Ade, Adi" "Doilfal, Dolfal"
If there are values that you don't want to change, then this regex by #Wiktor will work (it works even with any scenario!):
x <- c("Ade/i", "Do(i)lfal", "Peter", "Mary")
gsub('(\\w*)\\((\\w+)\\)(\\w*)', '\\1\\2\\3, \\1\\3', gsub("(\\w*)(\\w)/(\\w)\\b", "\\1\\2, \\1\\3", x))
which gives you:
[1] "Ade, Adi" "Doilfal, Dolfal" "Peter" "Mary"

Iterating over words across vector of strings and applying change to single word

Given string:
words <- c("fauuucet water", "tap water")
I would like to apply toupper function to all words that contain u.
Desired results
res <- c("FAUUCET water", "tap water")
Function
change_u_case <- function(str) {
sapply(
X = str,
FUN = function(search_term) {
sapply(
X = strsplit(search_term, split = "\\s", perl = TRUE),
FUN = function(word) {
if (grepl(pattern = "u", x = word)) {
toupper(word)
}
}
,
USE.NAMES = FALSE
)
},
USE.NAMES = FALSE
)
}
Tests
change_u_case(words) -> tst_res
words
tst_res
unlist(tst_res)
Notes
In particular, I'm interested whether solution using single rapply call could be build
rlist::list.iter approach would be also interesting
The selection of words containing u character is an example, in practice I would be looking to apply various conditions reflecting length and so on
You can use a single sapply call, i.e.
sapply(strsplit(words, ' '), function(i) {i1 <- grepl('u', i);
i[i1] <- toupper(i[i1]);
paste0(i, collapse = ' ')
})
#[1] "FAUUUCET water" "tap water"
Here is a stringi-based solution:
library(stringi);
sapply(stri_extract_all_words(words),
function(w) paste(ifelse(stri_detect(w, regex = "u"), toupper(w), w), collapse = " "))
#[1] "FAUUUCET water" "tap water"
Try stringr:
str_replace_all(words, '\\w*u\\w*', toupper)
# [1] "FAUUUCET water" "tap water"
More examples:
str_replace_all(c('Upset', 'day day up'), '\\w*u\\w*', toupper)
# [1] "Upset" "day day UP"

String Dataframe R [duplicate]

I have data that looks like this:
vector = c("hello I like to code hello","Coding is fun", "fun fun fun")
I want to remove duplicate words (space delimited) i.e. the output should look like
vector_cleaned
[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"
Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:
vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun" "fun"
## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))
Update based on comments
You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.
myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
a <- if (isTRUE(onlyUnique)) unique(x) else x
paste(a[nchar(a) > minLen], collapse = " ")
}
Compare the output of the following to see how it would work.
vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)
I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:
library(tidyverse)
df <- data.frame(vector = c("hello I like to code hello",
"Coding is fun",
"fun fun fun"))
df %>%
mutate(split = str_split(vector, " ")) %>% # split
mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine
Result:
#> vector split
#> 1 hello I like to code hello hello I like to code
#> 2 Coding is fun Coding is fun
#> 3 fun fun fun fun
Created on 2021-01-03 by the reprex package (v0.3.0)
Using tidyverse
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_longer_delim(vector, delim = regex("\\s+")) %>%
distinct() %>%
reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>%
select(-rn)
-output
vector
1 hello I like to code
2 Coding is fun
3 fun

How do keep only unique words within each string in a vector

I have data that looks like this:
vector = c("hello I like to code hello","Coding is fun", "fun fun fun")
I want to remove duplicate words (space delimited) i.e. the output should look like
vector_cleaned
[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"
Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:
vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun" "fun"
## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))
Update based on comments
You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.
myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
a <- if (isTRUE(onlyUnique)) unique(x) else x
paste(a[nchar(a) > minLen], collapse = " ")
}
Compare the output of the following to see how it would work.
vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)
I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:
library(tidyverse)
df <- data.frame(vector = c("hello I like to code hello",
"Coding is fun",
"fun fun fun"))
df %>%
mutate(split = str_split(vector, " ")) %>% # split
mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine
Result:
#> vector split
#> 1 hello I like to code hello hello I like to code
#> 2 Coding is fun Coding is fun
#> 3 fun fun fun fun
Created on 2021-01-03 by the reprex package (v0.3.0)
Using tidyverse
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_longer_delim(vector, delim = regex("\\s+")) %>%
distinct() %>%
reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>%
select(-rn)
-output
vector
1 hello I like to code
2 Coding is fun
3 fun

Resources