I have a data.frame with ids composed of sequences of alphanumeric characters (e.g., id = c(A001, A002, B013)). I was looking for an easy function under stringr or stirngi that would easily do math with this strings (id + 1 should return c(A002, A003, B014)).
I made a custom function that does the trick, however I have a feeling that there must be a better/more efficient/within package way to achieve this.
str_add_n <- function(df, string, n, width=3){
string <- enquo(string)
## split the string using pattern
df <- df %>%
separate(!!string,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])",
remove=FALSE
) %>%
mutate(num = as.numeric(num),
num = num + n,
num = stringr::str_pad(as.character(num),
width = width,
side = "left",
pad = 0
)
) %>%
unite(next_string, text:num, sep = "")
return(df)
}
Let's make a toy df
df <- data.frame(id = c("A001", "A002", "B013"))
str_add_n(df, id, 1)
id next_string
1 A001 A002
2 A002 A003
3 B013 B014
Again, this works, I'm wondering if there's a better way to do this, all tweaks welcome!
UPDATE
Based on the suggested answers I ran some benchmarking and it appears that both come very close, I would be inclined for the str_add_n_2 (I changed the name to be able to run both, and took the suggestion of x<-as.character(x))
microbenchmark::microbenchmark(question = str_add_n(df, id, 1),
answer = df %>% mutate_at(vars(id), funs(str_add_n_2(., 1))),
string_add = df %>% mutate_at(vars(id), funs(string_add(as.character(.)))))
Which yields
Unit: milliseconds
expr min lq mean median uq
question 4.312094 4.448391 4.695276 4.570860 4.755748
answer 2.932146 3.017874 3.191262 3.117627 3.240688
string_add 3.388442 3.466466 3.699363 3.534416 3.682762
max neval cld
10.29253 100 c
8.24967 100 a
9.05441 100 b
More tweaks are welcome!
Here is a way with gsubfn
id <- c("A001", "A002", "B013")
library(gsubfn)
gsubfn("([0-9]+)", function(x) sprintf("%03.0f", as.numeric(x) + 1), id)
#[1] "A002" "A003" "B014"
You could make it a function
string_add <- function(string, add = 1, width = 3) {
gsubfn::gsubfn("([0-9]+)", function(x) sprintf(paste0("%0", width, ".0f"), as.numeric(x) + add), string)
}
string_add(id, add = 10, width = 5)
#"A00011" "A00012" "B00023"
I'd suggest it's easier to define the function based on a vector of strings and not hard-code it to looking for columns in the frame; for the latter, you can always use something like mutate_at(vars(id,...), funs(str_add_n)).
str_add_n <- function(x, n = 1L) {
gr <- gregexpr("\\d+", x)
reg <- regmatches(x, gr)
widths <- nchar(reg)
regmatches(x, gr) <- sprintf(paste0("%0", widths, "d"), as.integer(reg) + n)
x
}
vec <- c("A001", "A002", "B013")
str_add_n(vec)
# [1] "A002" "A003" "B014"
If in a frame:
df <- data.frame(id = c("A001", "A002", "B013"), x = 1:3,
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate_at(vars(id), funs(str_add_n(., 3)))
# id x
# 1 A004 1
# 2 A005 2
# 3 B016 3
Caveat: this silently requires true character, not factor ... a possible defensive tactic might be to add x <- as.character(x) in the function definition.
Related
I am trying to apply fm.Choquet function (Rfmtool package) to my R data frame, but no success. The function works like this (ref. here):
# let x <- 0.6 (N = 1)
# and y <- c(0.3, 0.5). y elements are always 2 power N (here, 2)
# env<-fm.Init(1). env is propotional to N
# fm.Choquet(0.6, c(0.3, 0.5),env) gives a single value output
I have this sample data frame:
set.seed(123456)
a <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
b <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
c <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
df <- data.frame(a=a, b=b, c=c)
df$id <- seq_len(nrow(df))
I would like to apply fm.Choquet function to each row of my df such that, for each row (or ID), a is read as x, while b and c are read as y vector (N = 2), and add the function output as a new column for each row. However, I am getting the dimension error "The environment mismatches the dimension to the fuzzy measure.".
Here is my attempt.
df2 <- df %>% as_tibble() %>%
rowwise() %>%
mutate(ci = fm.Choquet(df$a,c(df[,2],df[,3]), env)) %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>% # Also tried adding sum which works
as.matrix()
I am using dplyr::rowwise(), but I am open to looping or other suggestions. Can someone help me?
EDIT 1:
A relevant question is identified as a possible solution for the above question, but using one of the suggestions, by(), still throws the same error:
by(df, seq_len(nrow(df)), function(row) fm.Choquet(df$a,c(df$b,df$c), env))
set.seed(123456)
a <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
b <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
c <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
df <- data.frame(a = a, b = b, c = c)
df$id <- seq_len(nrow(df))
library(Rfmtool)
library(tidyverse)
env <- fm.Init(1)
map_dbl(
seq_len(nrow(df)),
~ {
row <- slice(df,.x)
fm.Choquet(
x = row$a,
v = c(row$b, row$c), env
)
}
)
I have a dataframe e.g.
df_reprex <- data.frame(id = rep(paste0("S",round(runif(100, 1000000, 9999999),0)), each=10),
date = rep(seq.Date(today(), by=-7, length.out = 10), 100),
var1 = runif(1000, 10, 20),
var2 = runif(1000, 20, 50),
var3 = runif(1000, 2, 5),
var250 = runif(1000, 100, 200),
var1_baseline = rep(runif(100, 5, 10), each=10),
var2_baseline = rep(runif(100, 50, 80), each=10),
var3_baseline = rep(runif(100, 1, 3), each=10),
var250_baseline = rep(runif(100, 20, 70), each=10))
I want to write a function containing a for loop that for each row in the dataframe will subtract every "_baseline" column from the non-baseline column with the same name.
I have created a script that automatically creates a character string containing the code I would like to run:
df <- df_reprex
# get only numeric columns
df_num <- df %>% dplyr::select_if(., is.numeric)
# create a version with no baselines
df_nobaselines <- df_num %>% select(-contains("baseline"))
#extract names of non-baseline columns
numeric_cols <- names(df_nobaselines)
#initialise empty string
mutatestring <- ""
#write loop to fill in string:
for (colname in numeric_cols) {
mutatestring <- paste(mutatestring, ",", paste0(colname, "_change"), "=", colname, "-", paste0(colname, "_baseline"))
# df_num <- df_num %>%
# mutate(paste0(col, "_change") = col - paste0(col, "_baseline"))
}
mutatestring <- substr(mutatestring, 4, 9999999) # remove stuff at start (I know it's inefficient)
mutatestring2 <- paste("df %>% mutate(", mutatestring, ")") # add mutate call
but when I try to call "mutatestring2" it just prints the character string e.g.:
[1] "df %>% mutate( var1_change = var1 - var1_baseline , var2_change = var2 - var2_baseline , var3_change = var3 - var3_baseline , var250_change = var250 - var250_baseline )"
I thought that this part would be relatively easy and I'm sure I've missed something obvious, but I just can't get the text inside that string to run!
I've tried various slightly ridiculous methods but none of them return the desired output (i.e. the result returned by the character string if it was entered into the console as a command):
call(mutatestring2)
eval(mutatestring2)
parse(mutatestring2)
str2lang(mutatestring2)
mget(mutatestring2)
diff_func <- function() {mutatestring2}
diff_func1 <- function() {
a <-mutatestring2
return(a)}
diff_func2 <- function() {str2lang(mutatestring2)}
diff_func3 <- function() {eval(mutatestring2)}
diff_func4 <- function() {parse(mutatestring2)}
diff_func5 <- function() {call(mutatestring2)}
diff_func()
diff_func1()
diff_func2()
diff_func3()
diff_func4()
diff_func5()
I'm sure there must be a very straightforward way of doing this, but I just can't work it out!
How do I convert a character string to something that I can run or pass to a magrittr pipe?
You need to use the text parameter in parse, then eval the result. For example, you can do:
eval(parse(text = "print(5)"))
#> [1] 5
However, using eval(parse()) is normally a very bad idea, and there is usually a more sensible alternative.
In your case you can do this without resorting to eval(parse()), for example in base R you could subtract all the appropriate variables from each other like this:
baseline <- grep("_baseline$", names(df_reprex), value = TRUE)
non_baseline <- gsub("_baseline", "", baseline)
df_new <- cbind(df_reprex, as.data.frame(setNames(mapply(
function(i, j) df_reprex[[i]] - df_reprex[[j]],
baseline, non_baseline, SIMPLIFY = FALSE),
paste0(non_baseline, "_corrected"))))
Or if you want to keep the whole thing in a single pipe without storing intermediate variables, you could do:
mapply(function(i, j) df_reprex[[i]] - df_reprex[[j]],
grep("_baseline$", names(df_reprex), value = TRUE),
gsub("_baseline", "", grep("_baseline$", names(df_reprex), value = TRUE)),
SIMPLIFY = FALSE) %>%
setNames(gsub("_baseline", "_corrected",
grep("_baseline$", names(df_reprex), value = TRUE))) %>%
as.data.frame() %>%
{cbind(df_reprex, .)}
I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
This is an extension of Update pairs of columns based on pattern in their names . Thus, this is partially motivated by curiosity and partially for entertainment.
While developing an answer to that question, it occurred to me that this may be one of those cases where a for loop is more efficient than an *apply function (and I've been looking for a good illustration of the fact that *apply is not necessarily "more efficient" than a well constructed for loop). So I'd like to pose the question again, and ask if anyone is able to write a solution using an *apply function (or purr if that's your thing) that performs better than the for loop I've written below. Performance will be judged on execution time as evaluated via microbenchmark on my laptop (A cheap Windows box running R 3.3.2).
data.table and dplyr suggestions are welcome as well. (I'm already making plans for what I'll do with all the microseconds I save).
The Challenge
Consider the data frame:
col_1 <- c(1,2,NA,4,5)
temp_col_1 <-c(12,2,2,3,4)
col_2 <- c(1,23,423,NA,23)
temp_col_2 <-c(1,2,23,4,5)
df_test <- data.frame(col_1, temp_col_1, col_2, temp_col_2)
set.seed(pi)
df_test <- df_test[sample(1:nrow(df_test), 1000, replace = TRUE), ]
For each col_x, replace the missing values with the corresponding value in temp_col_x. So, for example:
col_1 temp_col_1 col_2 temp_col_2
1 1 12 1 1
2 2 2 23 2
3 NA 2 423 23
4 4 3 NA 4
5 5 4 23 5
becomes
col_1 temp_col_1 col_2 temp_col_2
1 1 12 1 1
2 2 2 23 2
3 2 2 423 23
4 4 3 4 4
5 5 4 23 5
Existing Solutions
The for loop I've already written
temp_cols <- names(df_test)[grepl("^temp", names(df_test))]
cols <- sub("^temp_", "", temp_cols)
for (i in seq_along(temp_cols)){
row_to_replace <- which(is.na(df_test[[cols[i]]]))
df_test[[cols[i]]][row_to_replace] <- df_test[[temp_cols[i]]][row_to_replace]
}
My best apply function so far is:
lapply(names(df_test)[grepl("^temp_", names(df_test))],
function(tc){
col <- sub("^temp_", "", tc)
row_to_replace <- which(is.na(df_test[[col]]))
df_test[[col]][row_to_replace] <<- df_test[[tc]][row_to_replace]
})
Benchmarking
As (if) suggestions come in, I will begin showing benchmarks in edits to this question. (edit: code is now a copy of Frank's answer, but run 100 times on my machine, as promised)
library(magrittr)
library(data.table)
library(microbenchmark)
set.seed(pi)
nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
df_r = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
microbenchmark(times = 100,
for_vec = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[[col]][w] <- df_r[[col]][w]
}
}, lapply_vec = {
df_m <- df_m0
lapply(seq_along(df_m),
function(i){
w <- which(is.na(df_m[[i]]))
df_m[[i]][w] <<- df_r[[i]][w]
})
}, for_df = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[w, col] <- df_r[w, col]
}
}, lapply_df = {
df_m <- df_m0
lapply(seq_along(df_m),
function(i){
w <- which(is.na(df_m[[i]]))
df_m[w, i] <<- df_r[w, i]
})
}, mat = { # in lmo's answer
df_m <- df_m0
bah = is.na(df_m)
df_m[bah] = df_r[bah]
}, set = {
df_m <- copy(df_m0)
for (col in 1:nc){
w = which(is.na(df_m[[col]]))
set(df_m, i = w, j = col, v = df_r[w, col])
}
}
)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
for_vec 135.83875 157.84548 175.23005 166.60090 176.81839 502.0616 100 b
lapply_vec 135.67322 158.99496 179.53474 165.11883 178.06968 551.7709 100 b
for_df 173.95971 204.16368 222.30677 212.76608 224.78188 446.6050 100 c
lapply_df 181.46248 205.57069 220.38911 215.08505 223.98406 381.1006 100 c
mat 129.27835 154.01248 173.11378 159.83070 169.67439 453.0888 100 b
set 66.86402 81.08138 86.32626 85.51029 89.58331 123.1926 100 a
Data.table provides the set function to modify data.tables or data.frames by reference.
Here's a benchmark that is more flexible with respect to numbers of cols and rows and that sidesteps the awkward column-name stuff in the OP:
library(magrittr)
nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
df_r = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
library(data.table)
library(microbenchmark)
microbenchmark(times = 10,
for_vec = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[[col]][w] <- df_r[[col]][w]
}
}, lapply_vec = {
df_m <- df_m0
lapply(seq_along(df_m), function(i){
w <- which(is.na(df_m[[i]]))
df_m[[i]][w] <<- df_r[[i]][w]
})
}, for_df = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[w, col] <- df_r[w, col]
}
}, lapply_df = {
df_m <- df_m0
lapply(seq_along(df_m), function(i){
w <- which(is.na(df_m[[i]]))
df_m[w, i] <<- df_r[w, i]
})
}, mat = { # in lmo's answer
df_m <- df_m0
bah = is.na(df_m)
df_m[bah] = df_r[bah]
}, set = {
df_m <- copy(df_m0)
for (col in 1:nc){
w = which(is.na(df_m[[col]]))
set(df_m, i = w, j = col, v = df_r[w, col])
}
}
)
Which gives...
Unit: milliseconds
expr min lq mean median uq max neval
for_vec 77.06501 89.53430 100.10051 96.33764 106.13486 142.1329 10
lapply_vec 77.67366 89.04438 98.81510 99.08863 108.86491 117.2956 10
for_df 103.79097 130.33134 140.95398 144.46526 157.11335 161.4507 10
lapply_df 97.04616 114.17825 126.10633 131.20382 137.64375 149.7765 10
mat 73.47691 84.51473 100.16745 103.44476 112.58006 128.6166 10
set 44.32578 49.58586 62.52712 56.30460 71.63432 101.3517 10
Comments:
If we adjust nc and nr or the frequency of NAs, the ranking of these four options might change. I guess the more cols there are, the better the mat way (from #lmo's answer) and set way look.
The copy in the set test takes some extra time beyond what we'd see in practice, since the set function just modifies the table by reference (unlike the other options, I think).
Here is a readable solution. Probably slower than some.
df_test[c(TRUE, FALSE)][is.na(df_test[c(TRUE, FALSE)])] <-
df_test[c(FALSE, TRUE)][is.na(df_test[c(TRUE, FALSE)])]
This could be sped up a bit with pre-allocating the replacement so it is only performed once.
filler <- is.na(df_test[c(TRUE, FALSE)])
df_test[c(TRUE, FALSE)][filler] <- df_test[c(FALSE, TRUE)][filler]
In a two data.frame scenario, df1 and df2, this logic would be
filler <- is.na(df1)
df1[filler] <- df2[filler]
Maybe this is naive, but how about neither? I think it's still in the spirit of things if you're just looking for the fastest method. I suspect this won't be it though.
col_1 <- c(1,2,NA,4,5)
temp_col_1 <-c(12,2,2,3,4)
col_2 <- c(1,23,423,NA,23)
temp_col_2 <-c(1,2,23,4,5)
df_test <- data.frame(col_1, temp_col_1, col_2, temp_col_2)
set.seed(pi)
df_test <- df_test[sample(1:nrow(df_test), 1000, replace = TRUE), ]
df_test$col_1 <- ifelse(is.na(df_test$col_1), df_test$temp_col_1,df_test$col_1)
df_test$col_2 <- ifelse(is.na(df_test$col_2), df_test$temp_col_2,df_test$col_2)
I have a data frame with two string variables with an equal number of characters. These strings represent a student responses for some exam. The first string contains a + sign for each question answered correctly and the incorrect response for each incorrect item. The second string contains all the correct answers. I want to replace all the + signs in the first string with the correct answer from the second string. A simplified heuristic data set can be created with this code:
df <- data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD","CDCCA"), stringsAsFactors = FALSE)
So the + signs in df$v1 need to be replaced w/ the letters in df$v2 that are the same distance from the start of the string. Any ideas?
When df$v1 and df$v2 are characters we may use
regmatches(df$v1, gregexpr("\\+", df$v1)) <- regmatches(df$v2, gregexpr("\\+", df$v1))
That is,
df <- data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD", "CDCCA"),
stringsAsFactors = FALSE)
rg <- gregexpr("\\+", df$v1)
regmatches(df$v1, rg) <- regmatches(df$v2, rg)
df
# v1 v2
# 1 DAAAB DBBAD
# 2 DDCCC BDCAD
# 3 ADBAD CDCCA
rg contains the positions of "+" in df$v1, and we conveniently exploit regmatches to replace those matches in df$v1 with whatever is in df$v2 at the same positions.
This one seems valid, too:
mapply(function(x, y) paste0(ifelse(x == "+", y, x), collapse = ""),
strsplit(as.character(df$v1), ""), strsplit(as.character(df$v2), ""))
#[1] "DAAAB" "DDCCC" "ADBAD"
Based on Tyler Rinker's answer, conceptually it's the same, but using just one lapply and ifelse.
> dats <- lapply(df, function(x) do.call(rbind, strsplit(as.character(x), "")))
> apply(with(dats, ifelse(v1=="+", v2, v1)), 1, paste0, collapse="")
[1] "DAAAB" "DDCCC" "ADBAD"
Most likely there's a better approach but here's on where I make the two columns into matrices and then a lookup key:
## df<-data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"), v2 = c("DBBAD", "BDCAD","CDCCA"))
dats <- lapply(df, function(x) do.call(rbind, strsplit(as.character(x), "")))
dats[[1]][dats[[1]] == "+"] <- dats[[2]][dats[[1]] == "+"]
apply(dats[[1]], 1, paste, collapse = "")
## [1] "DAAAB" "DDCCC" "ADBAD"
I thought this one may be an interesting one to benchmark:
Unit: microseconds
expr min lq median uq max neval
Andrea() 296.693 313.953 321.884 328.4155 2443.051 1000
Josh() 300.891 314.420 319.551 326.5500 3748.779 1000
Tyler() 144.148 155.344 159.543 164.2080 2233.593 1000
Jibler() 174.937 188.932 193.597 198.7290 2269.514 1000
Alexis() 154.877 167.007 171.672 175.4040 2342.753 1000
Julius() 394.658 413.317 420.315 429.4120 2549.412 1000
df<-data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD","CDCCA"),
stringsAsFactors = F)
f <- function(x , y){
xs <- unlist(strsplit(x, split = ""))
ys <- unlist(strsplit(y, split = ""))
paste(ifelse(xs == "+", ys , xs), collapse = "")
}
vapply(df$v1, f , df$v2, FUN.VALUE = character(1))