Eliminating partially overlapping parts of 2 vectors in R - r

I wonder if it might be possible to drop the parts in n1 character vector that partially overlap with elements in f1 formula.
For example, in n1, we see "study_typecompare" & "study_typecontrol" partially overlap with study_type in f1.
Thus in the desired_output, we want to drop the "study_type" part of them. Because other elements (ex. time_wk_whn) in n1 fully overlap with an element in f1, we leave them unchanged.
Is obtaining my desired_output possible in BASE R or tidyvesrse?
f1 <- gi ~ 0 + study_type + time_wk_whn + time_wk_btw + items_whn +
items_btw + training_hr_whn + training_hr_btw
n1 <- c("study_typecompare","study_typecontrol","time_wk_whn",
"time_wk_btw","items_whn","items_btw","training_hr_whn",
"training_hr_btw")
desired_output <- c("compare","control", "time_wk_whn",
"time_wk_btw","items_whn","items_btw",
"training_hr_whn","training_hr_btw")

We create a function to pass the formula and the vector ('fmla', 'vec') respectively. Extract the variables from the 'fmla' (all.vars), find the values in the vector that are not found in the formula variables (setdiff), create a pattern by paste those variables and replace with blank ("") using sub, and update the 'vec', return the updated vector
fun1 <- function(fmla, vec) {
v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec
}
-checking
> identical(fun1(f1, n1), desired_output)
[1] TRUE

Related

R progressively search string of a hierarchical lookup table for matches

I am working with OPS codes which code the type of procedure performed in a hospital. The OPS coding list has a hierarchical structure of the form X-XXX.XX with X being numbers. The coding structure is hierarchical, that means, the first X- is a big set, then the XXX denote a subset type of procedure within the first X-, the last .XX denote a subspecialization of the XXX
so the code might be X-XXX, X-XXX. , X-XXX.X, X-XXX.XX
My problem is that a program we uses collapses the structure of the code to XXXX, XXXXX, or XXXXXXX and i would like to match the collapsed with the uncollapsed llokup table of definitions.
So I would like to have a routine that checks for each digit and then procedes to the next when performing the matching. grepl would not to because 5381 would match 65381 (the uncollapsed would be 5-381 and 6-538.1) which are totally different procedures. I would need something that would match character to character (first number second number etc) and respects the character positions.
When an exact match cannot be found, it should return the first match that matches the same character positions.
More examples in pseudocode
which("5381" %in% c("65381","53811", "5382")) should return 2 since the second item matches all available characters provided
which("5381" %in% c("538110","538111", "538221")) should return 1 (because its the first match, the lookup table within c() is sorted.
which("5381." %in% c("5381","538111", "538121")) should return 1 (because its the first match, the lookup table within c() is sorted. Note that the period is ignored in the match
which("5381.1" %in% c("5381","538111", "538112")) should return 2 (because its the first match that matches all available five characters and we don't have a fifth.
I know this is not the best example of a question in SO but I am open to improve the question.
This is probably too complicated but it works.
First define a generic to transform the input string to the OPS format. Then have a matching function check if x has y as a substring.
Note that the matching function does not check if x is a substring of y, it's the other way around.
as.ops <- function(x, ...) UseMethod("as.ops")
as.ops.default <- function(x, ...){
warning("The default method coerces its argument to character and calls the character method")
as.ops.character(as.character(x))
}
as.ops.character <- function(x, ...){
x <- gsub("[^[:digit:]]", "", x)
ops1 <- substr(x, 1, 1)
ops2 <- substr(x, 2, 4)
ops3 <- substring(x, 5)
y <- character(length(x))
n <- findInterval(nchar(x), c(0, 1, 4, 7))
y[n == 1] <- x[n == 1]
y[n != 1] <- paste(ops1[n != 1], ops2[n != 1], sep = "-")
o3 <- nchar(ops3) > 0
y[n == 3 & o3] <- paste(y[n == 3 & o3], ops3[n == 3 & o3], sep = ".")
y
}
ops_match <- function(x, y){
xo <- as.ops(x)
yo <- as.ops(y)
i <- (xo %in% yo) | grepl(yo, xo)
which(i)
}
x1 <- c("65381","53811", "5382")
x2 <- c("538110","538111", "538221")
x3 <- c("5381","538111", "538121")
x4 <- c("5381","538111", "538112")
y1 <- y2 <- "5381"
y3 <- "5381."
y4 <- "5381.1"
ops_match(x1, y1)
ops_match(x2, y2)
ops_match(x3, y3)
ops_match(x4, y4)

Lookup list of formulas in other list

I am comparing two lists of formulas to see if some previously computed models can be reused. Right now I'm doing this like this:
set.seed(123)
# create some random formulas
l1 <- l2 <- list()
for (i in 1:10) {
l1[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
l2[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
}
# at least one appears in the other list
l1[[5]] <- l2[[7]]
# helper function to convert formulas to character strings
as.formulaCharacter <- function(x) paste(deparse(x))
# convert both lists to strings
s1 <- sapply(l1, as.formulaCharacter)
s2 <- sapply(l2, as.formulaCharacter)
# look up elements of one vector in the other
idx <- match(s1, s2, nomatch = 0L) # 7
s1[idx] # found matching elements
However, I noticed that some formulas are not retrieved although they are practically equivalent.
f1 <- z ~ b + c + b:c
f2 <- z ~ c + b + c:b
match(as.formulaCharacter(f1), as.formulaCharacter(f2)) # no match
I get why this result is different, the strings just aren't the same, but I'm struggling with how to extend this approach method to also work for formulas with reordered elements. I could use strsplit to first sort all formula components independently, but that sounds horribly inefficient to me.
Any ideas?
If the formulas are restricted to a sum of terms which contain colon separated variables then we can create a standardized string by extracting the term labels, exploding those with colons, sorting them, pasting the exploded terms back together, sorting this and turning that into a formula string.
stdize <- function(fo) {
s <- strsplit(attr(terms(f2), "term.labels"), ":")
terms <- sort(sapply(lapply(s, sort), paste, collapse = ":"))
format(reformulate(terms, all.vars(fo)[1]))
}
stdize(f1) == stdize(f2)
## [1] TRUE

Converting a vector into formula

Given a data.frame and a vector only with -1,0,1 with length equal to the number of columns of the data.frame. Is there a natural way to transform the vector into a formula with those elements in position with a -1 appear on the left side of the formula and those with +1 appear on the right side?
For example, given the following data.frame
df = data.frame(
'a' = rnorm(10),
'b' = rnorm(10),
'c' = rnorm(10),
'd' = rnorm(10),
'e' = rnorm(10))
and following vector vec = c(-1,-1,0,1,1).
Is there a natural way to build formula a+b~d+e?
We assume that if there are no 1's in vec that we should use a right hand side of 1 and if there are no -1's in vec then the left hand side is empty.
The alternatives each produce a character string but if a formula class object is wanted use formula(s) where s is that string.
1) paste each side Subset out the names corresponding to vec -1 giving LHS and paste/collapse them and do the same with vec 1 giving RHS and paste those with ~ all together. If we knew that there were at least one 1 in vec we could omit the if statement. Of the solutions here this seems the most straightforward.
nms <- names(df)
LHS <- paste(nms[vec == -1], collapse = "+")
RHS <- paste(nms[vec == 1], collapse = "+")
if (RHS == "") RHS <- "1"
paste0(LHS, "~", RHS)
## [1] "a+b~d+e"
2) sapply Alternately combine the LHS and RHS lines into a single sapply. If we knew that there were at least one 1 in vec then we could
simplify the code by omitting the if statement. This approach is shorter than (1).
sa <- sapply(c(-1, 1), function(x) paste(names(df)[vec == x], collapse = "+"))
if (sa[2] == "") sa[2] <- "1"
paste0(sa[1], "~", sa[2])
## [1] "a+b~d+e"
3) tapply We can alternately combine the LHS and RHS lines into a single tapply like this:
ta <- tapply(names(df), vec, paste, collapse = "+")
paste0(if (any(vec == -1)) ta[["-1"]], "~", if (any(vec == 1)) ta[["1"]] else 1)
## [1] "a+b~d+e"
If we knew that -1 and 1 each appear at least once in vec then we can simplify the last line to:
paste0(ta[["-1"]], "~", ta[["1"]]])
## [1] "a+b~d+e"
Overall this approach is the shortest if we can guarantee that there will be at least one 1 and at least one -1 but otherwise handling the edge cases seems somewhat cumbersome compared to the other approaches.
We could do this by creating a group by paste
paste(aggregate(nm ~ vec, subset(data.frame(nm = names(df), vec,
stringsAsFactors = FALSE), vec != 0),
FUN = paste, collapse= ' + ')[['nm']], collapse=' ~ ')
#[1] "a + b ~ d + e"
Or another option is tapply
paste(tapply(names(df), vec, FUN = paste,
collapse= ' + ')[c('-1', '1')], collapse= ' ~ ')
#[1] "a + b ~ d + e"

Minimal number of coverage of big data lists

Following my question
I use the following code:
dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
....
p32<-c('att35','att34','att32')
In the real case there can be 1024 vectors.
I would like to find all the relevant p that the unification of them will be the maximal components of dist. I this case the solution would be p1, p3, p5. I want to choose the minimal number of p. In addition, in case there is no way to cover all the of dist component so I want to choose the maximal cover with minimal number of vectors (p).
N = 32
library(qdapTools)
library(dplyr)
library(data.table)
## generate matrix of attributes
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5,...,p32))
library (bigmemory)
## generate matrix of attributes
grid_matrix <- do.call(CJ, rep(list(1:0), N)) %>% as.big.matrix
Error: cannot allocate vector of size 8.0 Gb
I tried an alternative way for it:
grid_matrix <- do.call(CJ, rep(list(1:0), N)) %>% as.data.frame
grid_matrix <- as.matrix (grid_matrix)
And still got the same error.
How can I fix it and use it for big data? I wanted to continue with:
colnames(grid_matrix) <- paste0("p", 1:N)
combin_all_element_present <- rowSums(grid_matrix %*% attribute_matrix > 0) %>% `==`(., ncol(attribute_matrix))
grid_matrix_sub <- grid_matrix[combin_all_element_present, ]
grid_matrix_sub[rowSums(grid_matrix_sub) == min(rowSums(grid_matrix_sub)), ]
This is known as a set covering problem. It can be solved using integer linear programming. Let x1, x2, ... be 0/1 variables (one for each p variable) and represent p1, p2, ... as 0/1 vectors P1, P2, ... and dist as
a 0/1 vector D. Then the problem can be stated as:
min x1 + x2 + ... + x32
such that
P1 * x1 + P2 + x2 + ... + P32 * x32 >= D
which in R code is the following. First create a list p with the p vectors in sorted order. Use mixedsort so that p32 comes at the end instead of rigth after p3. Define attnames as the set of all att names in all the p vectors.
Then formulate the objective function (which equals the number of p's in the cover), the constraint matrix (consisting of the P vectors as columns) and the right hand side of the constraint equations (which is dist as a 0/1 vector). Finally run the integer linear program and convert the solution from a 0/1 vector to a vector of p names.
library(gtools)
library(lpSolve)
p <- mget(mixedsort(ls(pattern = "^p\\d+$")))
attnames <- mixedsort(unique(unlist(p)))
objective <- rep(1L, length(p))
const.mat <- sapply(p, function(x) attnames %in% x) + 0L
const.rhs <- (attnames %in% dist) + 0L
ans <- lp("min", objective, const.mat, ">=", const.rhs, all.bin = TRUE)
names(p)[ans$solution == 1L]
## [1] "p2" "p4" "p5"
The constraint matrix has a row for each attnames entry and a column for each p vector.
The solution produces the minimal covers of those attnames elements that are in dist. If every element of dist appears in at least one p vector then the solution will represent a cover of dist. If not, the solution will represent a cover of those att names in one or more p vectors that are also in dist; thus, this handles both cases discussed in the question. The uncovered elements of dist are:
setdiff(dist, attnames)
so if that is of zero length then the solution represents a complete cover of dist. If not the solution represents a cover of
intersect(dist, attnames)
The sorting done in the code is not stricly needed but it may be easier to work with the various inputs to the optimization by having the rows and columns of the constraint matrix in a logical order.
Note: Run this code from the question before running the above code:
dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
p32<-c('att35','att34','att32')
The answer already provided is perfect but another approach could be the following:
dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
library(qdapTools)
library(data.table)
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5))
minimal_sets <- function(superset, subsets_matrix, p){
setDT(subsets_matrix)
# removing the columns that are not in the superset
updated_sub_matr <- subsets_matrix[, which(names(subsets_matrix) %in% superset), with = F]
# initializing counter for iterations and the subset selected
subset_selected <- integer(0)
counter <- p
## Loop until either we ran out of iterations counter = 0 or we found the solution
while (counter > 0 & length(superset) > 0){
## find the row with the most matches with the superset we want to achieve
max_index <- which.max(rowSums(updated_sub_matr))
## remove from the superset the entries that match that line and from the subsets_matrix those columns as they dont contribute anymore
superset <- superset[which(updated_sub_matr[max_index, ] == 0)]
updated_sub_matr <- updated_sub_matr[, - which(updated_sub_matr[max_index, ] != 0), with = F]
counter <- counter - 1
subset_selected <- c(subset_selected, max_index)
}
if (length(superset) > 0){
print(paste0("No solution found, there are(is) ", length(superset), " element(s) left ", paste(superset, collapse = "-")))
} else {
print(paste0("Found a solution after ", p - counter, " iterations"))
}
print(paste0("Selected the following subsets: ", paste(subset_selected, collapse = "-")))
}
In this function you input your superset (in this case dist), the attribute_matrix and the number p which you want to check and it outputs the best possible solution it found as well as the iterations.
> minimal_sets(dist, attribute_matrix, 1)
[1] "No solution found, there are(is) 3 element(s) left att3-att4-att6"
[1] "Selected the following subsets: 1"
> minimal_sets(dist, attribute_matrix, 3)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5"
> minimal_sets(dist, attribute_matrix, 5)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5

Calculating the percentage of matching elements between several vectors in R

I have 5 vectors of strings and each vector has different number of elements. However, there are many elements in these vectors which are common.
Ex v1<-c("a","x","y","z")
v2 <-c("b","g","m","r","s","x","z")
v3 <-c("a","m","x","y","z","b","r","g")
v4 <-c("d","h","a","g","s","x")
v5 <-c("a","b","m","x","y","z")
I want to calculate the percentage of matches between all the vectors,depending on the number of elements matching. I do not want to compare it using the order of elements so we have to check each element of one vector against every element of every other vector. Here the max matches are between v1 and v5. We can say that the v1 and v5 have (8/10)*100=80% Thus I want all sets of two vectors with percentages higher that 50%.
An easy implementation would be to compare all combinations of two vectors. You can then use intersect to find the number of common values.
require(caTools)
comb <- combs(c("v1","v2","v3","v4","v5"), 2)
for (i in 1:nrow(comb)) {
a <- eval(parse(text = comb[i, 1]))
b <- eval(parse(text = comb[i, 2]))
prct <- 2 * length(intersect(a, b)) / (length(a) + length(b))
cat("\nMatching between", comb[i, 1], "and", comb[i, 2], "is", prct)
}
(Here prct is calculated as I think you've described in your example with v1 and v5)
Note that you can also do this using two nested for-loops, but I find combs easier to use to avoid duplicate combinations.
I used the info here and HERE to write the below function, just input your data frame and column numbers.
# x = data /// y = number of column in data for string 1 // x = number of column in data for string 2 //
string_matcher <- function(x, y, z) {
data <- x
char.x <- as.matrix(strsplit(as.character(data[,y]), ""))
char.y <- as.matrix(strsplit(as.character(data[,z]), ""))
stored_vector <- as.matrix(sapply(1:nrow(data), function(i) 2 * length(intersect(char.x[[i]], char.y[[i]])) /
(length(char.x[[i]]) + length(char.y[[i]]))))
return(stored_vector)
}

Resources