Using purrr::map2 when one variable is not part of the function - r

If I had a function like this:
foo <- function(var) {
if(length(var) > 5) stop("can't be greater than 5")
data.frame(var = var)
}
Where this worked:
df <- 1:20
foo(var = df[1:5])
But this didn't:
foo(var = df)
The desired output is:
var
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
If I know that I can only run this function in chunk of 5 rows, what would be the best approach if I wanted to evaluate all 20 rows? Can I use purrr::map() for this? Assume that the 5 row constraint is rigid.
Thanks in advance.

We split df in chunks of 5 each then use purrr::map_dfr to apply foo function on them then bind everything together by rows
library(tidyverse)
foo <- function(var) {
if(length(var) > 5) stop("can't be greater than 5")
data.frame(var = var)
}
df <- 1:20
df_split <- split(df, (seq(length(df))-1) %/% 5)
df_split
map_dfr(df_split, ~ foo(.x))
var
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20

You can use dplyr::group_by or tapply :
data.frame(df) %>%
mutate(grp = (row_number()-1) %/% 5) %>%
group_by(grp) %>%
mutate(var = foo(df)$var) %>%
ungroup %>%
select(var)
# # A tibble: 20 x 1
# var
# <int>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 6
# 7 7
# 8 8
# 9 9
# 10 10
# 11 11
# 12 12
# 13 13
# 14 14
# 15 15
# 16 16
# 17 17
# 18 18
# 19 19
# 20 20
data.frame(var=unlist(tapply(df,(df-1) %/% 5,foo)))
# var
# 01 1
# 02 2
# 03 3
# 04 4
# 05 5
# 11 6
# 12 7
# 13 8
# 14 9
# 15 10
# 21 11
# 22 12
# 23 13
# 24 14
# 25 15
# 31 16
# 32 17
# 33 18
# 34 19
# 35 20

Related

Simulations of Poisson Distributions with different sample sizes and lambdas

Let's start with a statistical model, X where X is a random Poisson distribution with parameter
k - lambda with k being a constant
X ~ Pois(k - lambda)
Now,assume that k = 20. How do I create a function or make use of for loops to run a simulation
where we have different values of lambda <- c(2, 4, 8, 16) and each lambda has different sample sizes, n = [1,25] (from n = 1 to n = 25 ).
n <- 1:10
k <- 20
lambda <- c(2, 4, 8, 16)
result <- rpois(n, k - lambda)
result
The output:
28 12 13 1 13 16 16 3 12 15
Now obviously, my code is wrong because it is not giving me the right output. For any lambda values there should be output for each sample size from n=1, n=2, n=3 and up to n = 25.
My idea is to use a double for loop in order to create this. A for loop for the changing sample size,n and another for loop for the changing lambda values but I'm not too sure how to implement this.
The expected output should be something like this. For lambda = 8,
11
12,11
13,11,14
11,14,14,14
......
10 9 13 13 13 11 8 17 10 11 13 11 17 13 9 8 13 15 10 10 15 14 14 15 9
You can Vectorize rpois and put it in outer.
n <- 1:10
k <- 20
lambdas <- c(2, 4, 8, 16)
set.seed(42)
res <- outer(n, k - lambdas, Vectorize(rpois)) |> apply(2, as.list)
Gives
res |> setNames(paste0('lambda_', lambdas))
# $lambda_2
# $lambda_2[[1]]
# [1] 11
#
# $lambda_2[[2]]
# [1] 16 16
#
# $lambda_2[[3]]
# [1] 16 23 17
#
# $lambda_2[[4]]
# [1] 19 14 18 13
#
# $lambda_2[[5]]
# [1] 23 12 17 17 14
#
# $lambda_2[[6]]
# [1] 13 14 12 13 13 15
#
# $lambda_2[[7]]
# [1] 13 18 24 13 10 15 21
#
# $lambda_2[[8]]
# [1] 17 33 14 19 16 23 19 12
#
# $lambda_2[[9]]
# [1] 15 21 10 16 15 19 28 23 17
#
# $lambda_2[[10]]
# [1] 28 20 22 29 17 16 17 15 18 21
#
#
# $lambda_4
# $lambda_4[[1]]
# [1] 15
#
# $lambda_4[[2]]
# [1] 15 17
#
# $lambda_4[[3]]
# [1] 19 11 14
#
# $lambda_4[[4]]
# [1] 16 18 18 15
#
# $lambda_4[[5]]
# [1] 15 13 16 11 18
#
# $lambda_4[[6]]
# [1] 11 16 15 23 12 18
#
# $lambda_4[[7]]
# [1] 15 10 18 14 12 15 13
#
# $lambda_4[[8]]
# [1] 20 14 20 22 19 11 17 20
#
# $lambda_4[[9]]
# [1] 9 22 15 16 18 18 13 20 14
#
# $lambda_4[[10]]
# [1] 19 14 22 14 19 15 17 22 21 15
#
#
# $lambda_8
# $lambda_8[[1]]
# [1] 13
#
# $lambda_8[[2]]
# [1] 15 12
#
# $lambda_8[[3]]
# [1] 17 11 14
#
# $lambda_8[[4]]
# [1] 10 7 8 8
#
# $lambda_8[[5]]
# [1] 20 8 13 11 12
#
# $lambda_8[[6]]
# [1] 7 14 16 14 13 10
#
# $lambda_8[[7]]
# [1] 13 10 11 15 13 12 11
#
# $lambda_8[[8]]
# [1] 15 16 8 8 9 16 13 13
#
# $lambda_8[[9]]
# [1] 7 9 6 9 6 4 12 13 26
#
# $lambda_8[[10]]
# [1] 12 9 8 10 13 12 11 18 10 10
#
#
# $lambda_16
# $lambda_16[[1]]
# [1] 1
#
# $lambda_16[[2]]
# [1] 2 4
#
# $lambda_16[[3]]
# [1] 3 6 6
#
# $lambda_16[[4]]
# [1] 1 6 3 5
#
# $lambda_16[[5]]
# [1] 2 4 7 4 7
#
# $lambda_16[[6]]
# [1] 5 5 6 7 2 2
#
# $lambda_16[[7]]
# [1] 2 6 6 3 4 4 3
#
# $lambda_16[[8]]
# [1] 3 7 3 1 5 5 2 1
#
# $lambda_16[[9]]
# [1] 3 0 4 7 3 3 4 2 3
#
# $lambda_16[[10]]
# [1] 3 7 7 5 5 11 4 2 2 6

Fill zeros for missing values in R

I am trying to deal with this problem.
I have a df with a date column and I want to count the occurences per hour. Here is what I've done:
x <- df %>%
mutate(hora = hour(date)) %>%
select(hora) %>%
count(hora)
that gives as a result:
> x
# A tibble: 19 x 2
hora n
<int> <int>
1 0 1
2 1 1
3 3 1
4 8 4
5 9 7
6 10 10
7 11 14
8 12 10
9 13 8
10 14 4
11 15 5
12 16 12
13 17 4
14 18 12
15 19 9
16 20 5
17 21 2
18 22 4
19 23 4
As you can see, there are hours that don't show up that would have n=0, like 2 or 4:7. What I want is it to add the hours that are not in x with n=0 so the table is complete.
The expected output should be something like this:
hora n
1 0 12
2 1 3
3 2 5
4 3 7
5 4 8
6 5 1
7 6 0
8 7 11
9 8 6
10 9 10
11 10 9
12 11 0
13 12 0
14 13 3
15 14 0
16 15 7
17 16 8
18 17 1
19 18 2
20 19 11
21 20 6
22 21 10
23 22 9
24 23 4
I tried creating a table with hours 0:23 and all n=0 and trying to sum the two tables but obviously that didn't work. I also tried x$hour <- 0:23, thinking that the missing values would be added, but it didn't work as well.
You could convert hora to factor and use .drop = FALSE in count
library(dplyr)
library(lubridate)
df %>%
mutate(hora = factor(hour(date), levels = 0:23)) %>%
count(hora, .drop = FALSE)
Another option is to use complete :
df %>%
mutate(hora = hour(date)) %>%
count(hora) %>%
tidyr::complete(hora = 0:23, fill = list(n = 0))
A solution in Base R merges a vector of hours with the summarized data, and sets the missing counts to 0.
textFile <- "row hour count
1 0 1
2 1 1
3 3 1
4 8 4
5 9 7
6 10 10
7 11 14
8 12 10
9 13 8
10 14 4
11 15 5
12 16 12
13 17 4
14 18 12
15 19 9
16 20 5
17 21 2
18 22 4
19 23 4"
data <- read.table(text = textFile,header = TRUE)[-1]
hours <- data.frame(hour = 0:23)
merged <- merge(data,hours,all.y = TRUE)
merged[is.na(merged$count),"count"] <- 0
...and the output:
> head(merged)
hour count
1 0 1
2 1 1
3 2 0
4 3 1
5 4 0
6 5 0
>

Split into groups based on (multiple) conditions?

I have set of marbles, of different colors and weights, and I want to split them into groups based on their weight and color.
The conditions are:
A group cannot weigh more than 100 units
A group cannot have more than 5 different-colored marbles.
A reproducible example:
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
color weight
1 1 22
2 15 33
3 13 35
4 11 13
5 6 26
6 8 15
7 10 3
8 16 22
9 14 21
10 3 16
11 4 26
12 20 30
13 9 31
14 2 16
15 7 12
16 17 13
17 19 19
18 5 17
19 12 12
20 18 40
And what I want is this group column:
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
TIA.
The below isn't an optimal assignment to the groups, it just does it sequentially through the data frame. It's uses rowwise and might not be the most efficient way as it's not a vectorized approach.
library(dplyr)
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
Below I create a rowwise function which we can apply using dplyr
assign_group <- function(color, weight) {
# Conditions
clists = append(color_list, color)
sum_val = group_sum + weight
num_colors = length(unique(color_list))
assign_condition = (sum_val <= 100 & num_colors <= 5)
#assign globals
cval <- if(assign_condition) clists else c(color)
sval <- ifelse(assign_condition, sum_val, weight)
gval <- ifelse(assign_condition, group_number, group_number + 1)
assign("color_list", cval, envir = .GlobalEnv)
assign("group_sum", sval, envir = .GlobalEnv)
assign("group_number", gval, envir = .GlobalEnv)
res = group_number
return(res)
}
I then setup a few global variables to track the allocation of the marbles to each group.
# globals
color_list <<- c()
group_sum <<- 0
group_number <<- 1
Finally run this function using mutate
test <- marbles %>% rowwise() %>% mutate(group = assign_group(color,weight)) %>% data.frame()
Which results in the below
color weight group
1 6 27 1
2 12 16 1
3 15 32 1
4 20 25 1
5 19 5 2
6 2 21 2
7 16 39 2
8 17 4 2
9 11 16 2
10 7 7 3
11 10 5 3
12 1 30 3
13 13 7 3
14 9 39 3
15 14 7 4
16 8 17 4
17 18 9 4
18 4 36 4
19 3 1 4
20 5 3 5
And seems to meet the constraints
test %>% group_by(group) %>% summarise(tot_w = sum(weight), n_c = length(unique(color)) )
group tot_w n_c
<dbl> <int> <int>
1 1 100 4
2 2 85 5
3 3 88 5
4 4 70 5
5 5 3 1
in base R you could write a recursive function as shown below:
create_group = function(df,a){
if(missing(a)) a = cumsum(df$weight)%/%100
b = !ave(df$color,a,FUN=seq_along)%%6
d = ave(df$weight,a+b,FUN=cumsum)>100
a = a+b+d
if (any(b|d)) create_group(df,a) else cbind(df,group = a+1)
}
create_group(df)
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5

creating new tibble columns based on mapping plus user data

I am trying generate new columns in a tibble from the output of a function that takes as input several existing columns of that tibble plus user data. As a simplified example, I would want to use this function
addup <- function(x, y, z){x + y + z}
and use it to add the numbers in the existing columns in this tibble...
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
# A tibble: 12 x 2
num1 num2
<int> <int>
1 8 5
2 6 3
3 7 7
4 3 11
5 1 2
6 2 1
7 11 6
8 10 9
9 4 8
10 9 12
11 5 10
12 12 4
...together with user input. For instance, if a user defines the vector
vec <- c(3,6,4)
I would like to generate one new column per item in vec, adding the mapped values with the user input values.
The desired result in this case would look something like:
# A tibble: 12 x 5
num1 num2 `3` `6` `4`
<int> <int> <dbl> <dbl> <dbl>
1 5 7 15 18 16
2 8 2 13 16 14
3 7 9 19 22 20
4 1 11 15 18 16
5 3 3 9 12 10
6 9 12 24 27 25
7 6 6 15 18 16
8 10 10 23 26 24
9 11 4 18 21 19
10 12 5 20 23 21
11 4 1 8 11 9
12 2 8 13 16 14
If I know vec beforehand, I could achieve this by
tib %>%
mutate("3" = map2_dbl(num1, num2, ~addup(.x, .y, 3)),
"6" = map2_dbl(num1, num2, ~addup(.x, .y, 6)),
"4" = map2_dbl(num1, num2, ~addup(.x, .y, 4)))
but as the length of vec can vary, I do not know how to generalize this. I've found this answer repeated mutate in tidyverse, but there the functions are repeated over the existing columns instead of using the multiple existing columns for mapping.
Any ideas?
Since we don't have to have the function or the colnames as arguments, this is relatively simple. You just need to iterate over vec with a function that returns the summed column, and then combine with the original table. If you have an addup function that accepts vector inputs then you can skip the whole map2 part; in fact this one does but I don't know if your real function does.
library(tidyverse)
vec <- c(3,6,4)
set.seed(1)
tib <- tibble(num1 = sample(12), num2 = sample(12))
addup <- function(c1, c2, z) {c1 + c2 + z}
addup_vec <- function(df, vec) {
new_cols <- map_dfc(
.x = vec,
.f = function(v) {
map2_dbl(
.x = df[["num1"]],
.y = df[["num2"]],
.f = ~ addup(.x, .y, v)
)
}
)
colnames(new_cols) <- vec
bind_cols(df, new_cols)
}
tib %>%
addup_vec(vec)
#> # A tibble: 12 x 5
#> num1 num2 `3` `6` `4`
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 4 9 16 19 17
#> 2 5 5 13 16 14
#> 3 6 8 17 20 18
#> 4 9 11 23 26 24
#> 5 2 6 11 14 12
#> 6 7 7 17 20 18
#> 7 10 3 16 19 17
#> 8 12 4 19 22 20
#> 9 3 12 18 21 19
#> 10 1 1 5 8 6
#> 11 11 2 16 19 17
#> 12 8 10 21 24 22
Created on 2019-01-16 by the reprex package (v0.2.0).
This uses lapply to apply the function to each element of your vector then binds the result to the original data frame and adds column names.
# Given example
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
addup <- function(x, y, z){x + y + z}
vec <- c(3,6,4)
# Add columns and bind to original data frame
foo <- cbind(tib, lapply(vec, function(x)addup(tib$num1, tib$num2, x)))
# Correct column names
colnames(foo)[(ncol(tib)+1):ncol(foo)] <- vec
# Print result
print(foo)
# num1 num2 3 6 4
# 1 4 9 16 19 17
# 2 5 5 13 16 14
# 3 6 8 17 20 18
# 4 9 11 23 26 24
# 5 2 6 11 14 12
# 6 7 7 17 20 18
# 7 10 3 16 19 17
# 8 12 4 19 22 20
# 9 3 12 18 21 19
# 10 1 1 5 8 6
# 11 11 2 16 19 17
# 12 8 10 21 24 22

r - use dplyr::group_by in combination with purrr::pmap

I have the following dataframe:
df <- data.frame(a = c(1:20),
b = c(2:21),
c = as.factor(c(rep(1,5), rep(2,10), rep(3,5))))
and I want to do the following:
df1 <- df %>% group_by(c) %>% mutate(a = lead(b))
but originally I have many variables to which I need to apply the lead() function in combination with group_by() on multiple variables. I'm trying the purrr::pmap() to achieve this:
df2 <- pmap(list(df[,1],df[,2],df[,3]), function(x,y,z) group_by(z) %>% lead(y))
Unfortunately this results in error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('integer', 'numeric')"
You can do this with mutate_at and named arguments to funs(), which creates new columns instead of overwriting them. Note that this does nothing to a but you can rename the columns after this as desired.
df <- data.frame(
a = c(1:20),
b = c(2:21),
b2 = 3:22,
b3 = 4:23,
c = as.factor(c(rep(1, 5), rep(2, 10), rep(3, 5)))
)
library(tidyverse)
df %>%
group_by(c) %>%
mutate_at(vars(starts_with("b")), funs(lead = lead(.)))
#> # A tibble: 20 x 8
#> # Groups: c [3]
#> a b b2 b3 c b_lead b2_lead b3_lead
#> <int> <int> <int> <int> <fct> <int> <int> <int>
#> 1 1 2 3 4 1 3 4 5
#> 2 2 3 4 5 1 4 5 6
#> 3 3 4 5 6 1 5 6 7
#> 4 4 5 6 7 1 6 7 8
#> 5 5 6 7 8 1 NA NA NA
#> 6 6 7 8 9 2 8 9 10
#> 7 7 8 9 10 2 9 10 11
#> 8 8 9 10 11 2 10 11 12
#> 9 9 10 11 12 2 11 12 13
#> 10 10 11 12 13 2 12 13 14
#> 11 11 12 13 14 2 13 14 15
#> 12 12 13 14 15 2 14 15 16
#> 13 13 14 15 16 2 15 16 17
#> 14 14 15 16 17 2 16 17 18
#> 15 15 16 17 18 2 NA NA NA
#> 16 16 17 18 19 3 18 19 20
#> 17 17 18 19 20 3 19 20 21
#> 18 18 19 20 21 3 20 21 22
#> 19 19 20 21 22 3 21 22 23
#> 20 20 21 22 23 3 NA NA NA
Created on 2018-09-07 by the reprex package (v0.2.0).

Resources