Get sum / mean in purrr over data.frame subset - r

I would like to find the sum across occurrences and then the mean of those sums across simulations in the following:
library(tidyverse)
set.seed(123)
s <- 2
data <- data.frame(
lamda = c(5, 2, 3),
meanlog = c(9, 10, 11),
sdlog = c(2, 2.1, 2.2))
data2 <- data %>%
mutate(freq = map(lamda, ~rpois(s, .x)),
freqsev = map(freq, ~map(.x, function(k) rlnorm(k, meanlog, sdlog))))
I would like to take the sum of freqsev then the mean of the sum of freqsev over the simulation (s) dimension:
Any ideas on how this can be achieved? Thank you!
data3 <- data2 %>%
mutate(sum-freqsev = ???
mean-sum-freqsev = ???)
Dimensions expected:
data2 is a data.frame with 3 rows (e.g . per lamda)
sum-freqsev should be list of <int [2]> i.e the sum of entries in freqsev.
mean-sum-freqsev should be a number, simply the mean of sum-freqsev per lamda

We can use a nested map to find sum_freqsev and a single map to find mean_sum_freqsev:
library(tidyverse)
data3 <- data2 %>%
mutate(sum_freqsev = freqsev %>% map(~map_dbl(., sum)),
mean_sum_freqsev = sum_freqsev %>% map_dbl(mean),
percentile = freqsev %>% map(~map(., ~quantile(.x, c(.50, .90)))))
The inner map_dbl sums the entries of freqsev over each simulation and returns a vector of type double instead of a list with two elements.
mean_sum_freqsev is calculated by taking the mean of each list element (a vector) of sum_freqsev and returning a double.
Output:
> as.tibble(data3)
# A tibble: 3 x 8
lamda meanlog sdlog freq freqsev sum_freqsev mean_sum_freqsev percentile
<dbl> <dbl> <dbl> <list> <list> <list> <dbl> <list>
1 5 9 2 <int [2]> <list [2]> <dbl [2]> 1493880. <list [2]>
2 2 10 2.1 <int [2]> <list [2]> <dbl [2]> 623586. <list [2]>
3 3 11 2.2 <int [2]> <list [2]> <dbl [2]> 15219. <list [2]>
> data3 %>% pull(percentile)
[[1]]
[[1]][[1]]
50% 90%
24633.8 1832533.5
[[1]][[2]]
50% 90%
22461.18 114075.74
[[2]]
[[2]][[1]]
50% 90%
470808.0 845321.7
[[2]][[2]]
50% 90%
12539.82 202665.48
[[3]]
[[3]][[1]]
50% 90%
3906.931 10100.830
[[3]][[2]]
50% 90%
NA NA

Related

Find size of intersection of two list-columns in sparklyr

I am working with a tbl_spark in sparklyr.
I have a spark Dataframe with two list-type columns, and I would like to output two things:
The intersection of both lists (as a list)
The number of elements in the intersection
My input data looks something like the following (using the mtcars dataset) where "sc" is my spark connection:
library(dplyr)
library(sparklyr)
## Load mtcars into spark with connection "sc"
mtcars_spark <- copy_to(sc, mtcars)
## Wrangle mtcars to get list columns using ft_regex_tokenizer()
tbl_with_lists <- mtcars_spark %>%
mutate(mpg_rounded = round(mpg, -1)) %>%
group_by(mpg_rounded) %>%
summarize(
cyl_all = paste(collect_set(as.character(cyl)), sep = ", "),
gear_all = paste(collect_set(as.character(gear)), sep = ", ")
) %>%
ungroup() %>%
ft_regex_tokenizer("cyl_all", "cyl_list", pattern = "[,]\\s*") %>%
ft_regex_tokenizer("gear_all", "gear_list", pattern = "[,]\\s*")
tbl_with_lists
## # Source: spark<?> [?? x 5]
## mpg_rounded cyl_all gear_all cyl_list gear_list
## <dbl> <chr> <chr> <list> <list>
## 1 10 8.0 3.0 <list [1]> <list [1]>
## 2 30 4.0 5.0, 4.0 <list [1]> <list [2]>
## 3 20 8.0, 6.0, 4.0 5.0, 3.0, 4.0 <list [3]> <list [3]>
I haven't had much success with finding out how to do this. Any ideas?
I have found what might be a bit of a workaround using explode().
Would be great if there were a more direct way though? Not sure how well this solution will scale up to larger datasets.
tbl_with_lists %>%
## First explode the lists to create new rows for each unique list value
mutate(
cyl_explode = explode(cyl_list)
) %>%
mutate(
gear_explode = explode(gear_list)
) %>%
## Summarize to count number of matches - this gives the size of the intersection of the two lists
group_by(mpg_rounded, cyl_list, gear_list) %>%
summarize(size_of_intersection = sum(as.integer(cyl_explode == gear_explode)))
## # Source: spark<?> [?? x 4]
## # Groups: mpg_rounded, cyl_list
## mpg_rounded cyl_list gear_list size_of_intersection
## <dbl> <list> <list> <dbl>
## 1 10 <list [1]> <list [1]> 0
## 2 30 <list [1]> <list [2]> 1
## 3 20 <list [3]> <list [3]> 1

Sum Product of a list in dataframe by Order id in R

I am doing a project which is similar to Uber Eat. I want to create new column in a data frame to calculate sub-total of these orders but because the class of each column is "list", R is not allowing me to do that. Do you know any ways to do it.
Thank you
a = c(1,2,3)
b = 1:2
c = (3,1)
P1 = c(12,13,4)
P2 = c(2,4)
P3 = c(12,1)
#My given dataframe will be:
Order | Price | Sub-total
a | P1 | sum(a*P1)
b | P2 | sum(b*P2)
c | P3 | sum(c*P3)
Expect output:
Subtotal = [50, 10, 37]
Please see the attached image to understand my dataframe
My dataframe
My goal is how to compute aP1, bP2, cP3 and then total sum of aP1....
library(tidyverse)
orders <- list(
a = c(1,2,3),
b = 1:2,
c = c(3,1)
)
prices <- list(
P1 = c(12,13,4),
P2 = c(2,4),
P3 = c(12,1)
)
tibble(
Order = orders,
Price = prices
) %>%
mutate(
sub_total = Order %>% map2_dbl(Price, ~ sum(.x * .y))
)
#> # A tibble: 3 x 3
#> Order Price sub_total
#> <named list> <named list> <dbl>
#> 1 <dbl [3]> <dbl [3]> 50
#> 2 <int [2]> <dbl [2]> 10
#> 3 <dbl [2]> <dbl [2]> 37
Created on 2021-10-01 by the reprex package (v2.0.1)
First, store your respective Order and Price data into a list
a = c(1,2,3)
b = 1:2
c = c(3,1)
P1 = c(12,13,4)
P2 = c(2,4)
P3 = c(12,1)
Order <- list(a, b, c)
Price <- list(P1, P2, P3)
Use a tibble so that you can easily set list columns.
Then using the tidyverse structure, map over the two list columns and apply your formula.
library(dplyr)
library(purrr)
df <- tibble(Order = Order, Price = Price)
df <- df %>%
mutate(Sub_total = map2_dbl(Order, Price, ~ sum( .x * .y)))
The result will be as you expected. You can see your original data stored as lists and then your sub-totals.
> df
# A tibble: 3 x 3
Order Price Sub_total
<list> <list> <dbl>
1 <dbl [3]> <dbl [3]> 50
2 <int [2]> <dbl [2]> 10
3 <dbl [2]> <dbl [2]> 37
The total sum would then be sum(df$Sub_total) which is 97.
Here is an option in base R
d1 <- data.frame(Order = I(list(a, b, c)), Price = I(list(P1, P2, P3)))
d1$Sub_total <- unlist(Map(`%*%`, d1$Order, d1$Price))
-output
> d1
Order Price Sub_total
1 1, 2, 3 12, 13, 4 50
2 1, 2 2, 4 10
3 3, 1 12, 1 37

preserve column-of-tibbles when using grouped summarise with reduce

Question
When using tibble %>% group_by() %>% summarise(...=reduce(...)) on a column containing tibbles, I would like the output to remain a column of tibbles. How do I do that most efficiently?
Minimal Example:
Setup
vec1 = rnorm(10)
vec2 = rnorm(10)
vec3 = rnorm(10)
vec4 = rnorm(10)
tib=tibble(grpvar=factor(c('a','a','b','b')))
tib$col2=1
tib$col2[1]=tibble(vec1)
tib$col2[2]=tibble(vec2)
tib$col2[3]=tibble(vec3)
tib$col2[4]=tibble(vec4)
This is what it looks like:
grpvar col2
<fct> <list>
1 a <dbl [10]>
2 a <dbl [10]>
3 b <dbl [10]>
4 b <dbl [10]>
A very minimal tibble with a variable that will be used for grouping, and another column containing tibbles which contain vectors of length 10.
Problem
Using reduce within summarise simplifies the output...
tib %>% group_by(grpvar) %>% summarise(aggr=reduce(col2,`+`))
yields:
grpvar aggr
<fct> <dbl>
1 a -0.0206
...
10 a -0.101
...
20 b 0.520
Here, the tibble becomes very long ... I don't want 10 rows per group variable, but instead just one tibble containing the 10 values.
Desired output:
This is what it should look like
desired_outout<-tibble(grpvar=c('a','b'),aggr=NA)
desired_outout$aggr[1]=tibble(reduce(tib$col2[1:2],`+`))
desired_outout$aggr[2]=tibble(reduce(tib$col2[3:4],`+`))
which looks like:
# A tibble: 2 x 2
grpvar aggr
<chr> <list>
1 a <dbl [10]>
2 b <dbl [10]
i.e., it retains the column-of-tibbles structure (which internally, I believe is a list of vectors)
Wrap reduce with list:
tib %>% group_by(grpvar) %>% summarise(aggr=list(reduce(col2,`+`)))
Output:
# A tibble: 2 x 2
grpvar aggr
<fct> <list>
1 a <dbl [10]>
2 b <dbl [10]>

summarize to vector output

Let's say I have the following (simplified) tibble containing a group and values in vectors:
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
# A tibble: 5 x 2
group values
<fct> <list>
1 A <int [3]>
2 A <int [3]>
3 B <int [3]>
4 B <int [3]>
5 B <int [3]>
tb_vec[[1,2]]
[1] 1 3 2
I would like to summarize the values vectors per group by summing them (vectorized) and tried the following:
tb_vec %>% group_by(group) %>%
summarize(vec_sum = colSums(purrr::reduce(values, rbind)))
Error: Column vec_sum must be length 1 (a summary value), not 3
The error surprises me, because tibbles (the output format) can contain vectors as well.
My expected output would be the following summarized tibble:
# A tibble: 2 x 2
group vec_sum
<fct> <list>
1 A <dbl [3]>
2 B <dbl [3]>
Is there a tidyverse solution accommodate the vector output of summarize? I want to avoid splitting the tibble, because then I loose the factor.
You just need to add list(.) within summarise in your solution, in order to be able to have a column with 2 elements, where each element is a vector of 3 values:
library(tidyverse)
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
tb_vec %>%
group_by(group) %>%
summarize(vec_sum = list(colSums(purrr::reduce(values, rbind)))) -> res
res$vec_sum
# [[1]]
# [1] 2 4 6
#
# [[2]]
# [1] 6 5 7

R sum a twice nested list using purrr

I have a data.frame with the following dimensions:
Output:
as_tibble(data2)
lamda meanlog sdlog freq freqsev
<dbl> <dbl> <dbl> <list> <list>
1 5 9 2 <int [4]> <list [4]>
2 2 10 2.1 <int [4]> <list [4]>
3 3 11 2.2 <int [4]> <list [4]>
where freqsev is a list of values of length freq, and freq itself is a list of values of length s, where s is the number of simulations.
library(tidyverse)
set.seed(123)
s <- 5
data <- data.frame(lamda = c(5, 2, 3), meanlog = c(9, 10, 11), sdlog = c(2, 2.1, 2.2))
data2 <- data %>% mutate(
freq = map(lamda, ~rpois(s, .x)),
freqsev = map(freq, ~map(.x, function(k) rlnorm(k, meanlog, sdlog)))
)
I would like to sum freqsev (producing <dbl [4]> where the [4] is the index of s) i.e. a sum over the number of freq occurrences e.g.
For data2$freqsev[[1]][[1]] I would expect the sum.
How can this be achieved? Thank you.
To be honest, this is a really complicated way of storing your data and you would probably be better off using unnest() after creating the freq column. However, you can get the sums of the freqsev vectors like this:
data2 <- data %>% mutate(
freq = map(lamda, ~rpois(s, .x)),
freqsev = map(freq, ~map(.x, function(k) rlnorm(k, meanlog, sdlog))),
freqsum = map(freqsev, ~map_dbl(.x, ~sum(.x)))
)
Because freqsev is a double-nested list, you also need to double-map the sum operation.

Resources