How to move patterns in a string in r? - r

I am trying to code a function that would allow me to move certain patterns in a string in r. For example, if my strings are pattern_string1, pattern_string2, pattern_string3, pattern_string4, I want to mutate them to string1_pattern, string2_pattern, string3_pattern, string4_pattern.
In oder to achieve this, I tried the following:
string_flip <- function(x, pattern){
if(str_detect(x, pattern)==TRUE){
str_remove(x, pattern) %>%
paste(x, "pattern", sep = "_")
}
}
However, when I try to apply this onto a vector of strings by the following code:
stringvector <- c(pattern_string1, pattern_string2, pattern_string3, pattern_string4, string5, string6)
string_flip(stringvector, "pattern")
it returns a warning and changes all vectors, not only the vectors that contain "pattern". In addition it does not only add pattern to the end of the string, it doubles the string itself as well, so I get the following result:
[1] "_string1_pattern_string1_pattern" "_string2_pattern_string2_pattern" "_string3_pattern_string3_pattern"
[4] "_string4_pattern_string4_pattern" "string5_string5_pattern" "string6_string6_pattern"
Can anybody help me with this?
Thanks a lot in advance!

Your function string_flip is not vectorised. It works for only one string at a time.
I think you have additional x which is why the string is doubling.
In paste, pattern should not be in quotes.
Try this function.
library(stringr)
string_flip <- function(x, pattern){
trimws(ifelse(str_detect(x, pattern),
str_remove(x, pattern) %>% paste(pattern, sep = "_"), x), whitespace = '_')
}
stringvector <- c('pattern_string1', 'pattern_string2', 'pattern_string3', 'pattern_string4')
string_flip(stringvector, "pattern")
#[1] "string1_pattern" "string2_pattern" "string3_pattern" "string4_pattern"

Related

replacing the same pattern in a string with new value each time

One string with 25 xy as patterns and a 25 long vector that should replace those 25 xy.
This is not for prgramming or anything complicated, I just wish to get a result, which I can copy and then paste into a forum that uses this BBcode inside the string to make a colorful line.
string <- "[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]"
colrs <- c( "08070D", "100F1A", "191627", "211E34", "292541", "312D4E", "39345B", "413C68", "4A4375", "524A82", "5A528E", "62599B", "6C64A6", "7971AE", "827BB3", "8C85B9", "958FBF", "9F99C4", "A9A3CA", "B2AED0", "BCB8D6", "C5C2DC", "CFCCE2", "D9D6E8", "E2E0ED")
and want this as a result
[COLOR="#08070D"]_[COLOR="#100F1A"]_[COLOR="#191627"]_[COLOR="#211E34"]_[COLOR="#292541"]_[COLOR="#312D4E"]_[COLOR="#39345B"]_[COLOR="#413C68"]_[COLOR="#4A4375"]_[COLOR="#524A82"]_[COLOR="#5A528E"]_[COLOR="#62599B"]_[COLOR="#6C64A6"]_[COLOR="#7971AE"]_[COLOR="#827BB3"]_[COLOR="#8C85B9"]_[COLOR="#958FBF"]_[COLOR="#9F99C4"]_[COLOR="#A9A3CA"]_[COLOR="#B2AED0"]_[COLOR="#BCB8D6"]_[COLOR="#C5C2DC"]_[COLOR="#CFCCE2"]_[COLOR="#D9D6E8"]_[COLOR="#E2E0ED"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]
I have completely revised my answer given that you removed the previous iteration of your code example. Here's the revised solution:
string <- '[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]_[COLOR="#xy"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]'
colrs <- c("08070D", "100F1A", "191627", "211E34", "292541", "312D4E", "39345B", "413C68", "4A4375", "524A82", "5A528E", "62599B", "6C64A6", "7971AE", "827BB3", "8C85B9", "958FBF", "9F99C4", "A9A3CA", "B2AED0", "BCB8D6", "C5C2DC", "CFCCE2", "D9D6E8", "E2E0ED")
library(stringr)
string0 <- string |>
str_split("xy") |>
unlist()
string0[seq_along(colrs)] |>
str_c(colrs, collapse = "") |>
str_c(string0[length(colrs)+1])
[1] "[COLOR=\"#08070D\"]_[COLOR=\"#100F1A\"]_[COLOR=\"#191627\"]_[COLOR=\"#211E34\"]_[COLOR=\"#292541\"]_[COLOR=\"#312D4E\"]_[COLOR=\"#39345B\"]_[COLOR=\"#413C68\"]_[COLOR=\"#4A4375\"]_[COLOR=\"#524A82\"]_[COLOR=\"#5A528E\"]_[COLOR=\"#62599B\"]_[COLOR=\"#6C64A6\"]_[COLOR=\"#7971AE\"]_[COLOR=\"#827BB3\"]_[COLOR=\"#8C85B9\"]_[COLOR=\"#958FBF\"]_[COLOR=\"#9F99C4\"]_[COLOR=\"#A9A3CA\"]_[COLOR=\"#B2AED0\"]_[COLOR=\"#BCB8D6\"]_[COLOR=\"#C5C2DC\"]_[COLOR=\"#CFCCE2\"]_[COLOR=\"#D9D6E8\"]_[COLOR=\"#E2E0ED\"]__[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]"
EDIT #2:
An easy solution to the new data problem is this:
library(stringr)
string0 <- unlist(str_split(gsub('"', "", string), "__?"))
str_c(str_replace(string0,'xy', colrs), collapse = "_")
[1] "[COLOR=#08070D]_[COLOR=#100F1A]_[COLOR=#191627]_[COLOR=#211E34]_[COLOR=#292541]_[COLOR=#312D4E]_[COLOR=#39345B]_[COLOR=#413C68]_[COLOR=#4A4375]_[COLOR=#524A82]_[COLOR=#5A528E]_[COLOR=#62599B]_[COLOR=#6C64A6]_[COLOR=#7971AE]_[COLOR=#827BB3]_[COLOR=#8C85B9]_[COLOR=#958FBF]_[COLOR=#9F99C4]_[COLOR=#A9A3CA]_[COLOR=#B2AED0]_[COLOR=#BCB8D6]_[COLOR=#C5C2DC]_[COLOR=#CFCCE2]_[COLOR=#D9D6E8]_[COLOR=#E2E0ED]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]_[/color]"
EDIT:
Given this data:
string <- "AxyBxyCxyDxyExy"
vector <- c(1,2,3,4,5)
and this desired result:
"A1B2C3D4E5"
you can do this:
library(stringr)
First, we extract the character that's before xy using str_extract_all:
string0 <- unlist(str_extract_all(string, ".(?=xy)"))
Next we do two things: a) we replace the lone character with itself (\\1) AND the vector value, and b) we collapse the separate strings into one large string using str_c:
str_c(str_replace(string0, "(.)$", str_c("\\1", vector)), collapse = "")
[1] "A1B2C3D4E5"

combining words in tm R is not achieving desired result

I am trying to combine a few words so that they count as one.
In this example I want val and valuatin to be counted as valuation.
The code I have been using to try and do this is below:
#load in package
library(tm)
replaceWords <- function(x, from, keep){
regex_pat <- paste(from, collapse = "|")
gsub(regex_pat, keep, x)
}
oldwords <- c("val", "valuati")
newword <- c("valuation")
TextDoc2 <- tm_map(TextDoc, replaceWords, from=oldwords, keep=newword)
However this does not work as expected. Any time there is val in a word it is now being replaced with valuation. For example equivalent becomes equivaluation. How do I get around this error and achieved my desired result?
Try this function -
replaceWords <- function(x, from, keep){
regex_pat <- sprintf('\\b(%s)\\b', paste(from, collapse = '|'))
gsub(regex_pat, keep, x)
}
val matches with equivalent. Adding word boundaries stop that from happening.
grepl('val', 'equivalent')
#[1] TRUE
grepl('\\bval\\b', 'equivalent')
#[1] FALSE

Using an anonymous function in mutate

I want to use the character strings from one column of a dataframe as the search string in a sub search of the character strings in another column of the dataframe on a row-by-row basis. I would like to do this using dplyr::mutate. I have figured out a way to do this using an anonymous function and apply, but I feel like apply shouldn't be necessary and I must be doing something wrong with how I'm implementing mutate. (And yes, I know that tools::file_path_sans_ext can give me the final result without needing to use mutate; I'm just want to understand how to use mutate.)
Here is the code that I think should work but doesn't:
files.vec <- dir(
dir.target,
full.names = T,
recursive = T,
include.dirs = F,
no.. = T
)
library(tools)
files.paths.df <- as.data.frame(
cbind(
path = files.vec,
directory = dirname(files.vec),
file = basename(files.vec),
extension = file_ext(files.vec)
)
)
library(tidyr)
library(dplyr)
files.split.df <- files.paths.df %>%
mutate(
no.ext = function(x) {
sub(paste0(".", x["extension"], "$"), "", x["file"])
}
)
| Error in mutate_impl(.data, dots) :
| Column `no.ext` is of unsupported type function
Here is the code that works, using apply:
files.split.df <- files.paths.df %>%
mutate(no.ext = apply(., 1, function(x) {
sub(paste0(".", x["extension"], "$"), "", x["file"])
}))
Can this be done without apply?
Apparently what you need is a whole bunch of parentheses. See https://stackoverflow.com/a/36906989/3277050
In your situation it looks like:
files.split.df <- files.paths.df %>%
mutate(
no.ext = (function(x) {sub(paste0(".", x["extension"], "$"), "", x["file"])})(.)
)
So it seems like if you wrap the whole function definition in brackets you can then treat it like a regular function and supply arguments to it.
New Answer
Really this is not the right way to use mutate at all though. I got focused in on the anonymous function part first without looking at what you are actually doing. What you need is a vectorized version of sub. So I used str_replace from the stringr package. Then you can just refer to columns by name because that is the beauty of dplyr:
library(tidyr)
library(dplyr)
library(stringr)
files.split.df <- files.paths.df %>%
mutate(
no.ext = str_replace(file, paste0(".", extension, "$"), ""))
Edit to Answer Comment
To use a user defined function where there isn't an existing vectorized function you could use Vectorize like this:
string_fun <- Vectorize(function(x, y) {sub(paste0(".", x, "$"), "", y)})
files.split.df <- files.paths.df %>%
mutate(
no.ext = string_fun(extension, file))
Or if you really don't want to name the function, which I do not recommend as it is much harder to read:
files.split.df <- files.paths.df %>%
mutate(
no.ext = (Vectorize(function(x, y) {sub(paste0(".", x, "$"), "", y)}))(extension, file))

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

Regular Expression: replace the n-th occurence

does someone know how to find the n-th occurcence of a string within an expression and how to replace it by regular expression?
for example I have the following string
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
and I want to replace the 5th occurence of '-' by '|'
and the 7th occurence of '-' by "||" like
[1] aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa
How do I do this?
Thanks,
Florian
(1) sub It can be done in a single regular expression with sub:
> sub("(^(.*?-){4}.*?)-(.*?-.*?)-", "\\1|\\3||", txt, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(2) sub twice or this variation which calls sub twice:
> txt2 <- sub("(^(.*?-){6}.*?)-", "\\1|", txt, perl = TRUE)
> sub("(^(.*?-){4}.*?)-", "\\1||", txt2, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(3) sub.fun or this variation which creates a function sub.fun which does one substitute. it makes use of fn$ from the gsubfn package to substitute n-1, pat, and value into the sub arguments. First define the indicated function and then call it twice.
library(gsubfn)
sub.fun <- function(x, pat, n, value) {
fn$sub( "(^(.*?-){`n-1`}.*?)$pat", "\\1$value", x, perl = TRUE)
}
> sub.fun(sub.fun(txt, "-", 7, "||"), "-", 5, "|")
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(We could have modified the arguments to sub in the body of sub.fun using paste or sprintf to give a base R solution but at the expense of some additional verbosity.)
This can be reformulated as a replacement function giving this pleasing sequence:
"sub.fun<-" <- sub.fun
tt <- txt # make a copy so that we preserve the input txt
sub.fun(tt, "-", 7) <- "||"
sub.fun(tt, "-", 5) <- "|"
> tt
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(4) gsubfn Using gsubfn from the gsubfn package we can use a particularly simple regular expression (its just "-") and the code has quite a straight forward structure. We perform the substitution via a proto method. The proto object containing the method is passed in place of a replacement string. The simplicity of this approach derives fron the fact that gsubfn automatically makes a count variable available to such methods:
library(gsubfn) # gsubfn also pulls in proto
p <- proto(fun = function(this, x) {
if (count == 5) return("|")
if (count == 7) return("||")
x
})
> gsubfn("-", p, txt)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
UPDATE: Some corrections.
UPDATE 2: Added a replacement function approach to (3).
UPDATE 3: Added pat argument to sub.fun.
An alternative possibility is using Hadley's stringr package which builds the basis for the function I wrote:
require(stringr)
replace.nth <- function(string, pattern, replacement, n) {
locations <- str_locate_all(string, pattern)
str_sub(string, locations[[1]][n, 1], locations[[1]][n, 2]) <- replacement
string
}
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
txt.new <- replace.nth(txt, "-", "|", 5)
txt.new <- replace.nth(txt.new, "-", "||", 7)
txt.new
# [1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa-aaa||aaa-aaa"
One way to do this is to use gregexpr to find the positions of the -:
posns <- gregexpr("-",txt)[[1]]
And then pasting together the relevant pieces and separators:
paste0(substr(txt,1,posns[5]-1),"|",substr(txt,posns[5]+1,posns[7]-1),"||",substr(txt,posns[7]+1,nchar(txt)))
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"

Resources