Given a data.frame and a vector only with -1,0,1 with length equal to the number of columns of the data.frame. Is there a natural way to transform the vector into a formula with those elements in position with a -1 appear on the left side of the formula and those with +1 appear on the right side?
For example, given the following data.frame
df = data.frame(
'a' = rnorm(10),
'b' = rnorm(10),
'c' = rnorm(10),
'd' = rnorm(10),
'e' = rnorm(10))
and following vector vec = c(-1,-1,0,1,1).
Is there a natural way to build formula a+b~d+e?
We assume that if there are no 1's in vec that we should use a right hand side of 1 and if there are no -1's in vec then the left hand side is empty.
The alternatives each produce a character string but if a formula class object is wanted use formula(s) where s is that string.
1) paste each side Subset out the names corresponding to vec -1 giving LHS and paste/collapse them and do the same with vec 1 giving RHS and paste those with ~ all together. If we knew that there were at least one 1 in vec we could omit the if statement. Of the solutions here this seems the most straightforward.
nms <- names(df)
LHS <- paste(nms[vec == -1], collapse = "+")
RHS <- paste(nms[vec == 1], collapse = "+")
if (RHS == "") RHS <- "1"
paste0(LHS, "~", RHS)
## [1] "a+b~d+e"
2) sapply Alternately combine the LHS and RHS lines into a single sapply. If we knew that there were at least one 1 in vec then we could
simplify the code by omitting the if statement. This approach is shorter than (1).
sa <- sapply(c(-1, 1), function(x) paste(names(df)[vec == x], collapse = "+"))
if (sa[2] == "") sa[2] <- "1"
paste0(sa[1], "~", sa[2])
## [1] "a+b~d+e"
3) tapply We can alternately combine the LHS and RHS lines into a single tapply like this:
ta <- tapply(names(df), vec, paste, collapse = "+")
paste0(if (any(vec == -1)) ta[["-1"]], "~", if (any(vec == 1)) ta[["1"]] else 1)
## [1] "a+b~d+e"
If we knew that -1 and 1 each appear at least once in vec then we can simplify the last line to:
paste0(ta[["-1"]], "~", ta[["1"]]])
## [1] "a+b~d+e"
Overall this approach is the shortest if we can guarantee that there will be at least one 1 and at least one -1 but otherwise handling the edge cases seems somewhat cumbersome compared to the other approaches.
We could do this by creating a group by paste
paste(aggregate(nm ~ vec, subset(data.frame(nm = names(df), vec,
stringsAsFactors = FALSE), vec != 0),
FUN = paste, collapse= ' + ')[['nm']], collapse=' ~ ')
#[1] "a + b ~ d + e"
Or another option is tapply
paste(tapply(names(df), vec, FUN = paste,
collapse= ' + ')[c('-1', '1')], collapse= ' ~ ')
#[1] "a + b ~ d + e"
Related
I wonder if it might be possible to drop the parts in n1 character vector that partially overlap with elements in f1 formula.
For example, in n1, we see "study_typecompare" & "study_typecontrol" partially overlap with study_type in f1.
Thus in the desired_output, we want to drop the "study_type" part of them. Because other elements (ex. time_wk_whn) in n1 fully overlap with an element in f1, we leave them unchanged.
Is obtaining my desired_output possible in BASE R or tidyvesrse?
f1 <- gi ~ 0 + study_type + time_wk_whn + time_wk_btw + items_whn +
items_btw + training_hr_whn + training_hr_btw
n1 <- c("study_typecompare","study_typecontrol","time_wk_whn",
"time_wk_btw","items_whn","items_btw","training_hr_whn",
"training_hr_btw")
desired_output <- c("compare","control", "time_wk_whn",
"time_wk_btw","items_whn","items_btw",
"training_hr_whn","training_hr_btw")
We create a function to pass the formula and the vector ('fmla', 'vec') respectively. Extract the variables from the 'fmla' (all.vars), find the values in the vector that are not found in the formula variables (setdiff), create a pattern by paste those variables and replace with blank ("") using sub, and update the 'vec', return the updated vector
fun1 <- function(fmla, vec) {
v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec
}
-checking
> identical(fun1(f1, n1), desired_output)
[1] TRUE
I am comparing two lists of formulas to see if some previously computed models can be reused. Right now I'm doing this like this:
set.seed(123)
# create some random formulas
l1 <- l2 <- list()
for (i in 1:10) {
l1[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
l2[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
}
# at least one appears in the other list
l1[[5]] <- l2[[7]]
# helper function to convert formulas to character strings
as.formulaCharacter <- function(x) paste(deparse(x))
# convert both lists to strings
s1 <- sapply(l1, as.formulaCharacter)
s2 <- sapply(l2, as.formulaCharacter)
# look up elements of one vector in the other
idx <- match(s1, s2, nomatch = 0L) # 7
s1[idx] # found matching elements
However, I noticed that some formulas are not retrieved although they are practically equivalent.
f1 <- z ~ b + c + b:c
f2 <- z ~ c + b + c:b
match(as.formulaCharacter(f1), as.formulaCharacter(f2)) # no match
I get why this result is different, the strings just aren't the same, but I'm struggling with how to extend this approach method to also work for formulas with reordered elements. I could use strsplit to first sort all formula components independently, but that sounds horribly inefficient to me.
Any ideas?
If the formulas are restricted to a sum of terms which contain colon separated variables then we can create a standardized string by extracting the term labels, exploding those with colons, sorting them, pasting the exploded terms back together, sorting this and turning that into a formula string.
stdize <- function(fo) {
s <- strsplit(attr(terms(f2), "term.labels"), ":")
terms <- sort(sapply(lapply(s, sort), paste, collapse = ":"))
format(reformulate(terms, all.vars(fo)[1]))
}
stdize(f1) == stdize(f2)
## [1] TRUE
I have two strings:
x1 = "03011000"
x2 = "13001000"
The strings have an exact overlap in their non-zero characters. I would like to get for every character position the max element.
So in this case the result would be:
result = "13011000"
The first character is a 1 because x2 has a 1 at the first position. The fourth character is also a 1 because x1 has a 1 at this position.
The way I go about it is the following:
paste0(mapply(pmax, strsplit(x1, ""), strsplit(x2, "")), collapse = "")
But this seems excessive as I have to split every character into its components and compare them. Is there a better approach to this?
Using raw comparison:
rawToChar(as.raw(pmax(as.numeric(charToRaw(x1)), as.numeric(charToRaw(x2)))))
# [1] "13011000"
We could wrap it into a function:
foo <- function(x, y){
mapply(FUN = function(x, y) {
rawToChar(as.raw(pmax(as.numeric(charToRaw(x)), as.numeric(charToRaw(y)))))
}, x = x, y = y, USE.NAMES = FALSE)
}
x1 <- "03011000"
x2 <- "13001000"
foo(x1, x2)
# [1] "13011000"
x1 <- c("03011000", "ab", "123")
x2 <- c("13001000", "cd", "212")
foo(x1, x2)
# [1] "13011000" "cd" "223"
The strings have an exact overlap in their non-zero characters.
I assume this means that when both strings are nonzero, they are guaranteed to match?
If so, it is sufficient to find the positions with zeros in one vector and not in the other (with setdiff) and make the string edit:
r <- gregexpr("0", c(x1,x2))
w <- setdiff(r[[1]], r[[2]])
rr <- structure(w, match.length = rep(1L, length(w)), useBytes = TRUE)
x = x1
regmatches(x, rr) <- regmatches(x2, rr)
x
# [1] "13011000"
How can I count the number of times the specific sequence of elements x exists in the longer vector y.
x <- c(1,2,3,4,5,6)
y <- c(1,2,3,4,5,6,3,2,0,1,2,3,4,5,6,1,2,3,4,5,6,9,2,1,2,3,4,5,6,1,2,3,4)
Sorry but could not come out with any way to do it as using match will match individual elements but not a string or sequence of elements.
In order to add to the current answer and Davids comment,
length(gregexpr(paste(x,collapse = ""), paste(y, collapse=""))[[1L]])
fails in the general case. This is because if there are no matches, the resultant index is -1, giving a length of 1, rather than correct answer 0.
x = c(1,2,3)
y = c(4,3,2,4,3,4,3,2,2,3)
length(gregexpr(paste(x,collapse = ""), paste(y, collapse=""))[[1L]])
# [1] 1
whereas stringi::stri_count_fixed returns the expected 0.
A different answer which does not rely on pasting to strings should you prefer it, I suspect that it is slower however:
library(zoo)
sum(rollapply(y, width = length(x), function(v) all(x == v)))
cy <- paste(y, collapse = "")
cx <- paste(x, collapse = "")
(nchar(cy) - nchar(gsub(cx, "", cy))) / nchar(cx)
[1] 4
This is also robust to the case mentioned by #jamieRowen when the expected result is 0.
I have a data frame with two string variables with an equal number of characters. These strings represent a student responses for some exam. The first string contains a + sign for each question answered correctly and the incorrect response for each incorrect item. The second string contains all the correct answers. I want to replace all the + signs in the first string with the correct answer from the second string. A simplified heuristic data set can be created with this code:
df <- data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD","CDCCA"), stringsAsFactors = FALSE)
So the + signs in df$v1 need to be replaced w/ the letters in df$v2 that are the same distance from the start of the string. Any ideas?
When df$v1 and df$v2 are characters we may use
regmatches(df$v1, gregexpr("\\+", df$v1)) <- regmatches(df$v2, gregexpr("\\+", df$v1))
That is,
df <- data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD", "CDCCA"),
stringsAsFactors = FALSE)
rg <- gregexpr("\\+", df$v1)
regmatches(df$v1, rg) <- regmatches(df$v2, rg)
df
# v1 v2
# 1 DAAAB DBBAD
# 2 DDCCC BDCAD
# 3 ADBAD CDCCA
rg contains the positions of "+" in df$v1, and we conveniently exploit regmatches to replace those matches in df$v1 with whatever is in df$v2 at the same positions.
This one seems valid, too:
mapply(function(x, y) paste0(ifelse(x == "+", y, x), collapse = ""),
strsplit(as.character(df$v1), ""), strsplit(as.character(df$v2), ""))
#[1] "DAAAB" "DDCCC" "ADBAD"
Based on Tyler Rinker's answer, conceptually it's the same, but using just one lapply and ifelse.
> dats <- lapply(df, function(x) do.call(rbind, strsplit(as.character(x), "")))
> apply(with(dats, ifelse(v1=="+", v2, v1)), 1, paste0, collapse="")
[1] "DAAAB" "DDCCC" "ADBAD"
Most likely there's a better approach but here's on where I make the two columns into matrices and then a lookup key:
## df<-data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"), v2 = c("DBBAD", "BDCAD","CDCCA"))
dats <- lapply(df, function(x) do.call(rbind, strsplit(as.character(x), "")))
dats[[1]][dats[[1]] == "+"] <- dats[[2]][dats[[1]] == "+"]
apply(dats[[1]], 1, paste, collapse = "")
## [1] "DAAAB" "DDCCC" "ADBAD"
I thought this one may be an interesting one to benchmark:
Unit: microseconds
expr min lq median uq max neval
Andrea() 296.693 313.953 321.884 328.4155 2443.051 1000
Josh() 300.891 314.420 319.551 326.5500 3748.779 1000
Tyler() 144.148 155.344 159.543 164.2080 2233.593 1000
Jibler() 174.937 188.932 193.597 198.7290 2269.514 1000
Alexis() 154.877 167.007 171.672 175.4040 2342.753 1000
Julius() 394.658 413.317 420.315 429.4120 2549.412 1000
df<-data.frame(v1 = c("+AA+B", "D++CC", "A+BAD"),
v2 = c("DBBAD", "BDCAD","CDCCA"),
stringsAsFactors = F)
f <- function(x , y){
xs <- unlist(strsplit(x, split = ""))
ys <- unlist(strsplit(y, split = ""))
paste(ifelse(xs == "+", ys , xs), collapse = "")
}
vapply(df$v1, f , df$v2, FUN.VALUE = character(1))