Dropping partially overlapping parts of 2 vectors in R - r

I wonder if it might be possible to drop the parts in n1 character vector that partially overlap with elements in f1 formula.
For example, in n1, we see "chyes"&"bmi:chyes" partially overlap with ch in f1.
Thus in the desired_output, we want to drop the "ch" part of them. Because other elements in n1 either fully overlap with an element in f1 (ex. bmi) or don't exist in f1 (ex. intrcpt), we leave them unchanged.
I have tried the following solution, but can't get my desired output.
Is obtaining my desired_output possible in BASE R or tidyvesrse?
f1 <- yi~ bmi*ch
n1 <- c("intrcpt","bmi","chyes","bmi:chyes")
desired_output <- c("intrcpt","bmi","yes","bmi:yes")
### Current unsuccessful solution:
foo <- function(fmla, vec) {
v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v1 <- paste0('^', v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec
}
### EXAMPLE OF USE:
foo(f1, n1)
# "intrcpt" "bmi" "chyes" "bmi:chyes"

This function does what you want, but I agree with #Onyambu that it is worth considering whether your underlying problem actually necessitates string manipulation.
f <- function(fm, nm) {
vars <- vapply(attr(terms(fm), "variables"), deparse, "")[-1L]
subpat <- paste0(gsub("([()])", "\\\\\\1", vars), collapse = "|")
l <- rapply(strsplit(nm, ":"), sub, how = "list",
perl = TRUE,
pattern = sprintf("^(?!(%1$s)$)(%1$s)(.+)$", subpat),
replacement = "\\3")
vapply(l, paste0, "", collapse = ":")
}
fm1 <- yi ~ bmi * ch
nm1 <- c("intrcpt", "bmi", "chyes", "bmi:chyes")
f(fm1, nm1)
[1] "intrcpt" "bmi" "yes" "bmi:yes"
fm2 <- yi ~ bmi * factor(ch)
nm2 <- c("intrcpt", "bmi", "factor(ch)yes", "bmi:factor(ch)yes")
f(fm2, nm2)
[1] "intrcpt" "bmi" "yes" "bmi:yes"
fm3 <- gi ~ 0 + time:pub_type + time_wk_whn + time_wk_btw
nm3 <- c("time_wk_whn", "time_wk_btw", "timePost-test 1:pub_typejournal", "timePost-test 2:pub_typejournal")
f(fm3, nm3)
[1] "time_wk_whn" "time_wk_btw"
[3] "Post-test 1:journal" "Post-test 2:journal"

Related

Eliminating partially overlapping parts of 2 vectors in R

I wonder if it might be possible to drop the parts in n1 character vector that partially overlap with elements in f1 formula.
For example, in n1, we see "study_typecompare" & "study_typecontrol" partially overlap with study_type in f1.
Thus in the desired_output, we want to drop the "study_type" part of them. Because other elements (ex. time_wk_whn) in n1 fully overlap with an element in f1, we leave them unchanged.
Is obtaining my desired_output possible in BASE R or tidyvesrse?
f1 <- gi ~ 0 + study_type + time_wk_whn + time_wk_btw + items_whn +
items_btw + training_hr_whn + training_hr_btw
n1 <- c("study_typecompare","study_typecontrol","time_wk_whn",
"time_wk_btw","items_whn","items_btw","training_hr_whn",
"training_hr_btw")
desired_output <- c("compare","control", "time_wk_whn",
"time_wk_btw","items_whn","items_btw",
"training_hr_whn","training_hr_btw")
We create a function to pass the formula and the vector ('fmla', 'vec') respectively. Extract the variables from the 'fmla' (all.vars), find the values in the vector that are not found in the formula variables (setdiff), create a pattern by paste those variables and replace with blank ("") using sub, and update the 'vec', return the updated vector
fun1 <- function(fmla, vec) {
v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec
}
-checking
> identical(fun1(f1, n1), desired_output)
[1] TRUE

consider i inside a for loop as a vector and not as a character in R

I have several vectors named: aa, bb, ab, ac, etc.
I'd like to use their names ("01","02","03","04", etc) to match the vectors with a df.
But I get an error because names(i) doesn't work, given that (in case i <- "aa") R reads names("aa"), instead of: names(aa).
How can be solved?
noquote(i) worked, but not inside names: names(noquote(i)). This produced an error.
x <- c("aa","bb","ab","ac")
names(x) <- c("01","02","03","04")
for (i in x) {
df[match(names(i), df$ID), i] <- noquote(i)
}
########################################
Example:
df <- data.frame(ID = c("01","02","02","03","03","03"),
var_1 = c(0,0,1,0,0,1),
var_2 = c(0,0,0,0,1,0),
var_3 = c(1,0,0,0,0,0))
aa <- tapply(df$var_1, df$ID, max, na.rm = T)
bb <- tapply(df$var_2, df$ID, max, na.rm = T)
cc <- tapply(df$var_3, df$ID, max, na.rm = T)
In df_2 there's just one rown for each ID and I want to include in df_2 the vectors I created with tapply.
df_2 <- data.frame(ID=c("01","02","03"), age=c(34,12,49))
In order to match exactly the ID I used the following code:
df_2[match(names(aa), df_2$ID), "aa"] <- aa
df_2[match(names(bb), df_2$ID), "bb"] <- bb
df_2[match(names(cc), df_2$ID), "cc"] <- cc
In order to avoid this repetition of code I was trying to use this code (that i wrote at the beginning of the post):
x <- c("aa","bb","cc")
for (i in x) {
df_2[match(names(i), df_2$ID), i] <- noquote(i)
}
but I have a problem with names(i)

Get non-zero values from string in R

I have two strings:
x1 = "03011000"
x2 = "13001000"
The strings have an exact overlap in their non-zero characters. I would like to get for every character position the max element.
So in this case the result would be:
result = "13011000"
The first character is a 1 because x2 has a 1 at the first position. The fourth character is also a 1 because x1 has a 1 at this position.
The way I go about it is the following:
paste0(mapply(pmax, strsplit(x1, ""), strsplit(x2, "")), collapse = "")
But this seems excessive as I have to split every character into its components and compare them. Is there a better approach to this?
Using raw comparison:
rawToChar(as.raw(pmax(as.numeric(charToRaw(x1)), as.numeric(charToRaw(x2)))))
# [1] "13011000"
We could wrap it into a function:
foo <- function(x, y){
mapply(FUN = function(x, y) {
rawToChar(as.raw(pmax(as.numeric(charToRaw(x)), as.numeric(charToRaw(y)))))
}, x = x, y = y, USE.NAMES = FALSE)
}
x1 <- "03011000"
x2 <- "13001000"
foo(x1, x2)
# [1] "13011000"
x1 <- c("03011000", "ab", "123")
x2 <- c("13001000", "cd", "212")
foo(x1, x2)
# [1] "13011000" "cd" "223"
The strings have an exact overlap in their non-zero characters.
I assume this means that when both strings are nonzero, they are guaranteed to match?
If so, it is sufficient to find the positions with zeros in one vector and not in the other (with setdiff) and make the string edit:
r <- gregexpr("0", c(x1,x2))
w <- setdiff(r[[1]], r[[2]])
rr <- structure(w, match.length = rep(1L, length(w)), useBytes = TRUE)
x = x1
regmatches(x, rr) <- regmatches(x2, rr)
x
# [1] "13011000"

Converting a vector into formula

Given a data.frame and a vector only with -1,0,1 with length equal to the number of columns of the data.frame. Is there a natural way to transform the vector into a formula with those elements in position with a -1 appear on the left side of the formula and those with +1 appear on the right side?
For example, given the following data.frame
df = data.frame(
'a' = rnorm(10),
'b' = rnorm(10),
'c' = rnorm(10),
'd' = rnorm(10),
'e' = rnorm(10))
and following vector vec = c(-1,-1,0,1,1).
Is there a natural way to build formula a+b~d+e?
We assume that if there are no 1's in vec that we should use a right hand side of 1 and if there are no -1's in vec then the left hand side is empty.
The alternatives each produce a character string but if a formula class object is wanted use formula(s) where s is that string.
1) paste each side Subset out the names corresponding to vec -1 giving LHS and paste/collapse them and do the same with vec 1 giving RHS and paste those with ~ all together. If we knew that there were at least one 1 in vec we could omit the if statement. Of the solutions here this seems the most straightforward.
nms <- names(df)
LHS <- paste(nms[vec == -1], collapse = "+")
RHS <- paste(nms[vec == 1], collapse = "+")
if (RHS == "") RHS <- "1"
paste0(LHS, "~", RHS)
## [1] "a+b~d+e"
2) sapply Alternately combine the LHS and RHS lines into a single sapply. If we knew that there were at least one 1 in vec then we could
simplify the code by omitting the if statement. This approach is shorter than (1).
sa <- sapply(c(-1, 1), function(x) paste(names(df)[vec == x], collapse = "+"))
if (sa[2] == "") sa[2] <- "1"
paste0(sa[1], "~", sa[2])
## [1] "a+b~d+e"
3) tapply We can alternately combine the LHS and RHS lines into a single tapply like this:
ta <- tapply(names(df), vec, paste, collapse = "+")
paste0(if (any(vec == -1)) ta[["-1"]], "~", if (any(vec == 1)) ta[["1"]] else 1)
## [1] "a+b~d+e"
If we knew that -1 and 1 each appear at least once in vec then we can simplify the last line to:
paste0(ta[["-1"]], "~", ta[["1"]]])
## [1] "a+b~d+e"
Overall this approach is the shortest if we can guarantee that there will be at least one 1 and at least one -1 but otherwise handling the edge cases seems somewhat cumbersome compared to the other approaches.
We could do this by creating a group by paste
paste(aggregate(nm ~ vec, subset(data.frame(nm = names(df), vec,
stringsAsFactors = FALSE), vec != 0),
FUN = paste, collapse= ' + ')[['nm']], collapse=' ~ ')
#[1] "a + b ~ d + e"
Or another option is tapply
paste(tapply(names(df), vec, FUN = paste,
collapse= ' + ')[c('-1', '1')], collapse= ' ~ ')
#[1] "a + b ~ d + e"

Replace a given character in a string variable with a character from another string variable of equal length

I have a data frame with two string variables with an equal number of characters. These strings represent a student responses for some exam. The first string contains a + sign for each question answered correctly and the incorrect response for each incorrect item. The second string contains all the correct answers. I want to replace all the + signs in the first string with the correct answer from the second string. A simplified heuristic data set can be created with this code:
df <- data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD","CDCCA"), stringsAsFactors = FALSE)
So the + signs in df$v1 need to be replaced w/ the letters in df$v2 that are the same distance from the start of the string. Any ideas?
When df$v1 and df$v2 are characters we may use
regmatches(df$v1, gregexpr("\\+", df$v1)) <- regmatches(df$v2, gregexpr("\\+", df$v1))
That is,
df <- data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD", "CDCCA"),
stringsAsFactors = FALSE)
rg <- gregexpr("\\+", df$v1)
regmatches(df$v1, rg) <- regmatches(df$v2, rg)
df
# v1 v2
# 1 DAAAB DBBAD
# 2 DDCCC BDCAD
# 3 ADBAD CDCCA
rg contains the positions of "+" in df$v1, and we conveniently exploit regmatches to replace those matches in df$v1 with whatever is in df$v2 at the same positions.
This one seems valid, too:
mapply(function(x, y) paste0(ifelse(x == "+", y, x), collapse = ""),
strsplit(as.character(df$v1), ""), strsplit(as.character(df$v2), ""))
#[1] "DAAAB" "DDCCC" "ADBAD"
Based on Tyler Rinker's answer, conceptually it's the same, but using just one lapply and ifelse.
> dats <- lapply(df, function(x) do.call(rbind, strsplit(as.character(x), "")))
> apply(with(dats, ifelse(v1=="+", v2, v1)), 1, paste0, collapse="")
[1] "DAAAB" "DDCCC" "ADBAD"
Most likely there's a better approach but here's on where I make the two columns into matrices and then a lookup key:
## df<-data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"), v2 = c("DBBAD", "BDCAD","CDCCA"))
dats <- lapply(df, function(x) do.call(rbind, strsplit(as.character(x), "")))
dats[[1]][dats[[1]] == "+"] <- dats[[2]][dats[[1]] == "+"]
apply(dats[[1]], 1, paste, collapse = "")
## [1] "DAAAB" "DDCCC" "ADBAD"
I thought this one may be an interesting one to benchmark:
Unit: microseconds
expr min lq median uq max neval
Andrea() 296.693 313.953 321.884 328.4155 2443.051 1000
Josh() 300.891 314.420 319.551 326.5500 3748.779 1000
Tyler() 144.148 155.344 159.543 164.2080 2233.593 1000
Jibler() 174.937 188.932 193.597 198.7290 2269.514 1000
Alexis() 154.877 167.007 171.672 175.4040 2342.753 1000
Julius() 394.658 413.317 420.315 429.4120 2549.412 1000
df<-data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD","CDCCA"),
stringsAsFactors = F)
f <- function(x , y){
xs <- unlist(strsplit(x, split = ""))
ys <- unlist(strsplit(y, split = ""))
paste(ifelse(xs == "+", ys , xs), collapse = "")
}
vapply(df$v1, f , df$v2, FUN.VALUE = character(1))

Resources