collapse / compress vector of repeated elements to max k-repeated - r

Is there a more efficient way than function below based on rle to compress/collapse a vector, of lets's say strings, into max k-repeated. Example input and desired outputs given below, .
Input
foov <- rep(c("a", "b", "a"), c(5, 3, 2))
For k = 2, desired output would be:
"a" "a" "b" "b" "a" "a"
And for k = 3, desired output would be:
"a" "a" "a" "b" "b" "b" "a" "a"
At the moment I am using rle as follows to achieve this:
collapseRLE <- function(v, k) {
vrle <- rle(v)
vrle$lengths[vrle$lengths > k] <- k
ret <- rep(vrle$values, vrle$lengths)
return(invisible(ret))
}
foov <- rep(c("a", "b", "a"), c(5, 3, 2))
print(collapseRLE(foov, 2))

We can use rleid from data.table. Based on the grouping by rleid on the vector, we subset from the index provided the sequence of 'k' and extract the columns as a vector ($V1)
library(data.table)
f1 <- function(k, vec) data.table(vec)[, vec[seq_len(pmin(k, .N))], rleid(vec)]$V1
f1(2, foov)
#[1] "a" "a" "b" "b" "a" "a"
f1(3, foov)
#[1] "a" "a" "a" "b" "b" "b" "a" "a"

Related

Generate all combinations (and their sum) of a vector of characters in R

Suppose that I have a vector of length n and I need to generate all possible combinations and their sums. For example:
If n=3, we have:
myVec <- c("a", "b", "c")
Output =
"a"
"b"
"c"
"a+b"
"a+c"
"b+c"
"a+b+c"
Note that we consider that a+b = b+a, so only need to keep one.
Another example if n=4,
myVec <- c("a", "b", "c", "d")
Output:
"a"
"b"
"c"
"d"
"a+b"
"a+c"
"a+d"
"b+c"
"b+d"
"c+d"
"a+b+c"
"a+c+d"
"b+c+d"
"a+b+c+d"
We can use sapply with varying length in combn and use paste as function to apply.
sapply(seq_along(myVec), function(n) combn(myVec, n, paste, collapse = "+"))
#[[1]]
#[1] "a" "b" "c"
#[[2]]
#[1] "a+b" "a+c" "b+c"
#[[3]]
#[1] "a+b+c"
myVec <- c("a", "b", "c", "d")
sapply(seq_along(myVec), function(n) combn(myVec, n, paste, collapse = "+"))
#[[1]]
#[1] "a" "b" "c" "d"
#[[2]]
#[1] "a+b" "a+c" "a+d" "b+c" "b+d" "c+d"
#[[3]]
#[1] "a+b+c" "a+b+d" "a+c+d" "b+c+d"
#[[4]]
#[1] "a+b+c+d"
We can unlist if we need output as single vector.

Trouble evaluating combinations from combn using purrr

I am trying to use combn to divide a group of n = 20 different units into 3 groups of unequal size -- 4, 6 and 10. Then I am trying to validate for values that must be together within a group -- if one element from the pair exists in the group then the other should also be in the group. If one is not in the group then neither should be in the group. In this fashion, I'd like to evaluate the groups in order to find all possible valid solutions where the rules are true.
x <- letters[1:20]
same_group <- list(
c("a", "c"),
c("d", "f"),
c("b", "k", "r")
)
combinations_list <- combn(x, 4, simplify = F)
validate_combinations <- function(x) all(c("a", "c") %in% x) | !any(c("a", "c") %in% x)
valid_combinations <- keep(combinations_list, validate_combinations)
In this way I'd like to combine -> reduce each group until I have a list of all valid combinations. I'm not sure how to combine combinations_list, validate_combinations, and the same_group to check all same_group "rules" against the combinations in the table. The furthest I can get is to check against one combination c("a", "c"), which when run against keep(combinations_list, validate_combinations) is indeed giving me the output I want.
I think once I can do this, I can then use the unpicked values in another combn function for the group of 6 and the group of 10.
We can change the function to accept variable group
validate_combinations <- function(x, group) all(group %in% x) | !any(group %in% x)
then for each group subset the combinations_list which satisfy validate_combinations
lapply(same_group, function(x) combinations_list[
sapply(combinations_list, function(y) validate_combinations(y, x))])
#[[1]]
#[[1]][[1]]
#[1] "a" "b" "c" "d"
#[[1]][[2]]
#[1] "a" "b" "c" "e"
#[[1]][[3]]
#[1] "a" "b" "c" "f"
#[[1]][[4]]
#[1] "a" "b" "c" "g"
#[[1]][[5]]
#[1] "a" "b" "c" "h"
#[[1]][[6]]
#[1] "a" "b" "c" "i"
#[[1]][[7]]
#[1] "a" "b" "c" "j"
#[[1]][[8]]
#[1] "a" "b" "c" "k"
#......

Return all elements of list containing certain strings

I have a list of vectors containing strings and I want R to give me another list with all vectors that contain certain strings. MWE:
list1 <- list("a", c("a", "b"), c("a", "b", "c"))
Now, I want a list that contains all vectors with "a" and "b" in it. Thus, the new list should contain two elements, c("a", "b") and c("a", "b", "c").
As list1[grep("a|b", list1)] gives me a list of all vectors containing either "a" or "b", I expected list1[grep("a&b", list1)] to do what I want, but it did not (it returned a list of length 0).
This should work:
test <- list("a", c("a", "b"), c("a", "b", "c"))
test[sapply(test, function(x) sum(c('a', 'b') %in% x) == 2)]
Try purrr::keep
library(purrr)
keep(list1, ~ all(c("a", "b") %in% .))
We can use Filter
Filter(function(x) all(c('a', 'b') %in% x), test)
#[[1]]
#[1] "a" "b"
#[[2]]
#[1] "a" "b" "c"
A solution with grepl:
> list1[grepl("a", list1) & grepl("b", list1)]
[[1]]
[1] "a" "b"
[[2]]
[1] "a" "b" "c"

R: Non-greedy version of setdiff?

Here's setdiff normal behaviour:
x <- rep(letters[1:4], 2)
x
# [1] "a" "b" "c" "d" "a" "b" "c" "d"
y <- letters[1:2]
y
# [1] "a" "b"
setdiff(x, y)
# [1] "c" "d"
… but what if I want y to be taken out only once, and therefore get the following result?
# "c" "d" "a" "b" "c" "d"
I'm guessing that there is an easy solution using either setdiff or %in%, but I just cannot see it.
match returns a vector of the positions of (first) matches of its first argument in its second. It's used as an index constructor:
x[ -match(y,x) ]
#[1] "c" "d" "a" "b" "c" "d"
If there are duplicates in 'y' and you want removal in proportion to their numbers therein, then the first thing that came to my mind is a for-loop:
y <- c("a","b","a")
x2 <- x
for( i in seq_along(y) ){ x2 <- x2[-match(y[i],x2)] }
> x2
[1] "c" "d" "b" "c" "d"
This would be one possible result of using the tabling approach suggested below. Uses some "set" functions, but this is not really a set problem. Seems somewhat more "vectorised":
c( table(x [x %in% intersect(x,y)]) - table(y[y %in% intersect(x,y)]) ,
table( x[!x %in% intersect(x,y)]) )
a b c d
0 1 2 2
vecsets package has vsetdiff function for this.
x <- rep(letters[1:4], 2)
y <- letters[1:2]
vecsets::vsetdiff(x, y)
#[1] "c" "d" "a" "b" "c" "d"
Here is another looping method. I think 42's method is cleaner, but it provides another option.
# construct a table containing counts for all possible values in x and y in y
myCounts <- table(factor(y, levels=sort(union(x, y))))
# extract these elements from x
x[-unlist(lapply(names(myCounts),
function(i) which(i == x)[seq_len(myCounts[i])]))]
The "non-greedy" aspect comes from [seq_len(myCounts[i])] which only takes the number of identical elements that are present in y

generate labels for variables in R

I'm searching for a better/faster way than this one to generate labels for a variable :
df <- data.frame(a=c(0,7,1,10,2,4,3,5,10,1,7,8,3,2))
pick <- c(0,1,2,3,10)
df[sapply(df$a,function(x) !(x %in% pick)),"a"] <- "a"
df[sapply(df$a,function(x) x==0),"a"] <- "b"
df[sapply(df$a,function(x) x==1 | x==2 | x==3),"a"] <- "c"
df[sapply(df$a,function(x) x==10),"a"] <- "d"
df$a
[1] "b" "a" "c" "d" "c" "a" "c" "a" "d" "c" "a" "a" "c" "c"
For simplicity, I just have one variable in this example, of course there are more variables in my dataset but I just want to change a specific one.
You don't need sapply:
df$a[!df$a %in% pick] <- "a"
df$a[df$a==0] <- "b"
df$a[df$a %in% 1:3] <- "c"
df$a[df$a==10] <- "d"
You could also produce the same result with factors:
df <- data.frame(a=c(0,7,1,10,2,4,3,5,10,1,7,8,3,2))
# the above method
a <- df$a
a[!df$a %in% pick] <- "a"
a[df$a==0] <- "b"
a[df$a %in% 1:3] <- "c"
a[df$a==10] <- "d"
# one way that gives a warning
b1 <- factor(df$a, levels=0:10, labels=c("b",rep("c",3),rep("a",6),"d"))
# another way that won't give a warning
b2 <- factor(df$a)
levels(b2) <- c("b",rep("c",3),rep("a",4),"d")
b2 <- as.character(b2)
# a third strategy using `library(car)`
b3 <- car::recode(df$a,"0='b';1:3='c';10='d';else='a'")
# check that all strategies are the same
all.equal(a,as.character(b1))
# [1] TRUE
all.equal(as.character(b1),as.character(b2))
# [1] TRUE
all.equal(as.character(b1),as.character(b3))
# [1] TRUE
You might also consider mapvalues or revalue in plyr, particularly if you're dealing with more labels:
df$a <- mapvalues(df$a, c(0, 1, 2, 3, 10), c("b", "c", "c", "c", "d"))
df$a[! df$a %in% c("b", "c", "d")] <- "a" # The !pick values
Here is another fairly straightforward solution:
names(pick) <- c("b", "c", "c", "c", "d")
x <- names(pick[match(df$a, pick)])
x[is.na(x)] <- "a"
x
# [1] "b" "a" "c" "d" "c" "a" "c" "a" "d" "c" "a" "a" "c" "c"
It is even more straightforward if you include an NA in your "pick" object.
pick <- c(NA, 0, 1, 2, 3, 10)
names(pick) <- c("a", "b", "c", "c", "c", "d")
names(pick[match(df$a, pick, nomatch = 1)])
# [1] "b" "a" "c" "d" "c" "a" "c" "a" "d" "c" "a" "a" "c" "c"
If you use this second alternative, note that nomatch takes an integer value of the position of what you're matching agains. Here, nomatch maps to "NA" which is in the first position in your "pick" vector. If the "NA" were in the last position, you would enter it as nomatch = 6 instead.
You can also use ifelse function.
with(df,ifelse(a==0,"b",ifelse(a %in% c(1,2,3),"c",ifelse(a==10,"d","a"))))
[1] "b" "a" "c" "d" "c" "a" "c" "a" "d" "c" "a" "a" "c" "c"

Resources