Create all possible combinations from two values for each element in a vector in R [duplicate] - r

This question already has answers here:
How to generate a matrix of combinations
(3 answers)
Closed 6 years ago.
I have been trying to create vectors where each element can take two different values present in two different vectors.
For example, if there are two vectors a and b, where a is c(6,2,9) and b is c(12,5,15) then the output should be 8 vectors given as follows,
6 2 9
6 2 15
6 5 9
6 5 15
12 2 9
12 2 15
12 5 9
12 5 15
The following piece of code works,
aa1 <- c(6,12)
aa2 <- c(2,5)
aa3 <- c(9,15)
for(a1 in 1:2)
for(a2 in 1:2)
for(a3 in 1:2)
{
v <- c(aa1[a1],aa2[a2],aa3[a3])
print(v)
}
But I was wondering if there was a simpler way to do this instead of writing several for loops which will also increase linearly with the number of elements the final vector will have.

expand.grid is a function that makes all combinations of whatever vectors you pass it, but in this case you need to rearrange your vectors so you have a pair of first elements, second elements, and third elements so the ultimate call is:
expand.grid(c(6, 12), c(2, 5), c(9, 15))
A quick way to rearrange the vectors in base R is Map, the multivariate version of lapply, with c() as the function:
a <- c(6, 2, 9)
b <- c(12, 5, 15)
Map(c, a, b)
## [[1]]
## [1] 6 12
##
## [[2]]
## [1] 2 5
##
## [[3]]
## [1] 9 15
Conveniently expand.grid is happy with either individual vectors or a list of vectors, so we can just call:
expand.grid(Map(c, a, b))
## Var1 Var2 Var3
## 1 6 2 9
## 2 12 2 9
## 3 6 5 9
## 4 12 5 9
## 5 6 2 15
## 6 12 2 15
## 7 6 5 15
## 8 12 5 15
If Map is confusing you, if you put a and b in a list, purrr::transpose will do the same thing, flipping from a list of two elements of length three to a list of three elements of length two:
library(purrr)
list(a, b) %>% transpose() %>% expand.grid()
and return the same thing.

I think what you're looking for is expand.grid.
a <- c(6,2,9)
b <- c(12,5,15)
expand.grid(a,b)
Var1 Var2
1 6 12
2 2 12
3 9 12
4 6 5
5 2 5
6 9 5
7 6 15
8 2 15
9 9 15

Related

Vector recycling concept in R

I am trying to understand the working of vector recycling in R. I have 2 vectors
c(2,4,6)
and
c(1,2)
And I want to use the rep() to produce an output as follows:
[1] 2 4 6 4 8 12
based on what I understand from ?rep() is that there are times and each parameters which do the operations which I tried.
> rep(c(2,4,6), times=2)
[1] 2 4 6 2 4 6
But I also see the first vector is multiplied by the first element of the second vector and then to the second element of the second vector. Not sure how to proceed with it.
You can use:
rep(c(2,4,6), 2) * rep(c(1,2), each=3)
#[1] 2 4 6 4 8 12
or with auto recycling:
c(2,4,6) * rep(c(1,2), each=3)
#[1] 2 4 6 4 8 12
Alternative outer could be used:
c(outer(c(2,4,6), c(1,2)))
#[1] 2 4 6 4 8 12
Also crossprod could be used:
c(crossprod(t(c(2,4,6)), c(1,2)))
#[1] 2 4 6 4 8 12
Or %*%:
c(c(2,4,6) %*% t(c(1,2)))
#[1] 2 4 6 4 8 12

How to implement extract/separate functions (from dplyr and tidyr) to separate a column into multiple columns. based on arbitrary values?

I have a column:
Y = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
I would like to split into multiple columns, based on the positions of the column values. For instance, I would like:
Y1=c(1,2,3,4,5)
Y2=c(6,7,8,9,10)
Y3=c(11,12,13,14,15)
Y4=c(16,17,18,19,20)
Since I am working with a big data time series set, the divisions will be arbitrary depending on the length of one time period.
You can use the base split to split this vector into vectors that are each 5 items long. You could also use a variable to store this interval length.
Using rep with each = 5, and creating a sequence programmatically, gets you a sequence of the numbers 1, 2, ... up to the length divided by 5 (in this case, 4), each 5 times consecutively. Then split returns a list of vectors.
It's worth noting that a variety of SO posts will recommend you store similar data in lists such as this, rather than creating multiple variables, so I'm leaving it in list form here.
Y <- 1:20
breaks <- rep(1:(length(Y) / 5), each = 5)
split(Y, breaks)
#> $`1`
#> [1] 1 2 3 4 5
#>
#> $`2`
#> [1] 6 7 8 9 10
#>
#> $`3`
#> [1] 11 12 13 14 15
#>
#> $`4`
#> [1] 16 17 18 19 20
Created on 2019-02-12 by the reprex package (v0.2.1)
Not a dplyr solution, but I believe the easiest way would involve using matrices.
foo = function(data, sep.in=5) {
data.matrix = matrix(data,ncol=5)
data.df = as.data.frame(data.matrix)
return(data.df)
}
I have not tested it but this function should create a data.frame who can be merge to a existing one using cbind()
We can make use of split (writing the commented code as solution) to split the vector into a list of vectors.
lst <- split(Y, as.integer(gl(length(Y), 5, length(Y))))
lst
#$`1`
#[1] 1 2 3 4 5
#$`2`
#[1] 6 7 8 9 10
#$`3`
#[1] 11 12 13 14 15
#$`4`
#[1] 16 17 18 19 20
Here, the gl create a grouping index by specifying the n, k and length parameters where n - an integer giving the number of levels, k - an integer giving the number of replications, and length -an integer giving the length of the result.
In our case, we want to have 'k' as 5.
as.integer(gl(length(Y), 5, length(Y)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
If we want to have multiple objects in the global environment, use list2env
list2env(setNames(lst, paste0("Y", seq_along(lst))), envir = .GlobalEnv)
Y1
#[1] 1 2 3 4 5
Y2
#[1] 6 7 8 9 10
Y3
#[1] 11 12 13 14 15
Y4
#[1] 16 17 18 19 20
Or as the OP mentioned dplyr/tidyr in the question, we can use those packages as well
library(tidyverse)
tibble(Y) %>%
group_by(grp = (row_number()-1) %/% 5 + 1) %>%
summarise(Y = list(Y)) %>%
pull(Y)
#[[1]]
#[1] 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 10
#[[3]]
#[1] 11 12 13 14 15
#[[4]]
#[1] 16 17 18 19 20
data
Y <- 1:20

Generate sequence between each element of 2 vectors

I have a for loop that generate each time 2 vectors of the same length (length can vary for each iteration) such as:
>aa
[1] 3 5
>bb
[1] 4 8
I want to create a sequence using each element of these vectors to obtain that:
>zz
[1] 3 4 5 6 7 8
Is there a function in R to create that?
We can use Mapto get the sequence of corresponding elements of 'aa' , 'bb'. The output is a list, so we unlist to get a vector.
unlist(Map(`:`, aa, bb))
#[1] 3 4 5 6 7 8
data
aa <- c(3,5)
bb <- c(4, 8)
One can obtain a sequence by using the colon operator : that separates the beginning of a sequence from its end. We can define such sequences for each vector, aa and bb, and concatenate the results with c() into a single series of numbers.
To avoid double entries in overlapping ranges we can use the unique() function:
zz <- unique(c(aa[1]:aa[length(aa)],bb[1]:bb[length(bb)]))
#> zz
#[1] 3 4 5 6 7 8
with
aa <- c(3,5)
bb <- c(4,8)
Depending on your desired output, here are a few more alternatives:
> do.call("seq",as.list(range(aa,bb)))
[1] 3 4 5 6 7 8
> Reduce(seq,range(aa,bb)) #all credit due to #BrodieG
[1] 3 4 5 6 7 8
> min(aa,bb):max(aa,bb)
[1] 3 4 5 6 7 8

Subset columns using logical vector

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources