purrr::map equivalent to dplyr::do

purrr::map equivalent to dplyr::do - r

Reading https://twitter.com/hadleywickham/status/719542847045636096 I understand that the purrr approach should basically replace do.
Hence, I was wondering how I would use purrr to do this:
library(dplyr)
d <- data_frame(n = 1:3)
d %>% rowwise() %>% do(data_frame(x = seq_len(.$n))) %>% ungroup()
# # tibble [6 x 1]
# x
# * <int>
# 1 1
# 2 1
# 3 2
# 4 1
# 5 2
# 6 3
The closest I could get was something like:
library(purrrr)
d %>% mutate(x = map(n, seq_len))
# # A tibble: 3 x 2
# n x
# <int> <list>
# 1 1 <int [1]>
# 2 2 <int [2]>
# 3 3 <int [3]>
map_int would not work. So what is the purrrr way of doing it?

You could do the following:
library(tidyverse)
library(purrr)
d %>% by_row(~data_frame(x = seq_len(.$n))) %>% unnest()
by_row applies a function to each row, storing the result in nested tibbles.
unnest is then used to remove the nesting and to concatenate the tibbles.

Using pmap() removes the need for nesting and unnesting.
library(tidyverse)
d %>% pmap_df(~data_frame(x = seq_len(.)))

Related

How to extract first value from lists in data.frames columns?

This question is similar to R: How to extract a list from a dataframe?
But I could not implement it to my question in an easy way.
weird_df <- data_frame(col1 =c('hello', 'world', 'again'),col_weird = list(list(12,23), list(23,24), NA),col_weird2 = list(list(0,45), list(4,45),list(45,45.45,23)))
weird_df
# A tibble: 3 x 3
col1 col_weird col_weird2
<chr> <list> <list>
1 hello <list [2]> <list [2]>
2 world <list [2]> <list [2]>
3 again <lgl [1]> <list [3]>
>
I want in the columns col_weirdand col_weird2 to only display the first value of the current list.
col1 col_weird col_weird2
1 hello 12 0
2 world 23 4
3 again NA 45
My real problem has a lot of columns.I tried this (altered acceptend answer in posted link)
library(tidyr)
library(purrr)
weird_df %>%
mutate(col_weird = map(c(col_weird,col_weird2), toString ) ) %>%
separate(col_weird, into = c("col1"), convert = TRUE) %>%
separate(col_weird2, into = c("col2",convert = T)

One solution would be to write a simple function that extracts the first value from each list in a vector of lists . This you can then apply to the relevant columns in your data frame.
library(tibble)
#create data
weird_df <- tibble(col1 =c('hello', 'world', 'again'),
col_weird = list(list(12,23), list(23,24), NA),
col_weird2 = list(list(0,45), list(4,45), list(45,45.45,23)))
#function to extract first values from a vector of lists
fnc <- function(x) {
sapply(x, FUN = function(y) {y[[1]]})
}
#apply function to the relevant columns
weird_df[,2:3] <- apply(weird_df[,2:3], MARGIN = 2, FUN = fnc)
weird_df
# A tibble: 3 x 3
col1 col_weird col_weird2
<chr> <dbl> <dbl>
1 hello 12 0
2 world 23 4
3 again NA 45

Here is a dplyr solution
library(dplyr)
weird_df %>% mutate(across(c(col_weird, col_weird2), ~vapply(., `[[`, numeric(1L), 1L)))
Output
# A tibble: 3 x 3
col1 col_weird col_weird2
<chr> <dbl> <dbl>
1 hello 12 0
2 world 23 4
3 again NA 45

summarize to vector output

Let's say I have the following (simplified) tibble containing a group and values in vectors:
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
# A tibble: 5 x 2
group values
<fct> <list>
1 A <int [3]>
2 A <int [3]>
3 B <int [3]>
4 B <int [3]>
5 B <int [3]>
tb_vec[[1,2]]
[1] 1 3 2
I would like to summarize the values vectors per group by summing them (vectorized) and tried the following:
tb_vec %>% group_by(group) %>%
summarize(vec_sum = colSums(purrr::reduce(values, rbind)))
Error: Column vec_sum must be length 1 (a summary value), not 3
The error surprises me, because tibbles (the output format) can contain vectors as well.
My expected output would be the following summarized tibble:
# A tibble: 2 x 2
group vec_sum
<fct> <list>
1 A <dbl [3]>
2 B <dbl [3]>
Is there a tidyverse solution accommodate the vector output of summarize? I want to avoid splitting the tibble, because then I loose the factor.

You just need to add list(.) within summarise in your solution, in order to be able to have a column with 2 elements, where each element is a vector of 3 values:
library(tidyverse)
set.seed(1)
(tb_vec <- tibble(group = factor(rep(c("A","B"), c(2,3))),
values = replicate(5, sample(3), simplify = FALSE)))
tb_vec %>%
group_by(group) %>%
summarize(vec_sum = list(colSums(purrr::reduce(values, rbind)))) -> res
res$vec_sum
# [[1]]
# [1] 2 4 6
#
# [[2]]
# [1] 6 5 7

R collapse column to form numeric list

In R hoe do I collapse column to form another column with numeric lists types.
like we define numeric list as l = c(1,2,3)
df <- read.table(text = "X Y
a 26
a 3
a 24
b 8
b 1
b 4
", header = TRUE)
I am trying this with dplyr but it gives me character list column
> df %>% group_by(X) %>% summarise(lst= paste0(Y, collapse = ","))
# A tibble: 2 x 2
X lst
<fct> <chr>
1 a 26,3,24
2 b 8,1,4

group by X then summarise Y as list
library(dplyr)
out <- df %>%
group_by(X) %>%
summarise(Y = list(Y))
out
# A tibble: 2 x 2
# X Y
# <fct> <list>
#1 a <int [3]>
#2 b <int [3]>
The Y column now looks like this
out$Y
#[[1]]
#[1] 26 3 24
#
#[[2]]
#[1] 8 1 4
nest seems to be another option but this would result in a list column of tibbles (not what you want I think)
df %>%
group_by(X) %>%
nest()
# A tibble: 2 x 2
# X data
# <fct> <list>
#1 a <tibble [3 × 1]>
#2 b <tibble [3 × 1]>

A data.table solution:
library(data.table)
dt <- as.data.table(df)[, list(Y=list(Y)), by="X"]
> dt
X Y
1: a 26, 3,24
2: b 8,1,4
> dt$Y
[[1]]
[1] 26 3 24
[[2]]
[1] 8 1 4

filter list variable in dplyr

In general how do we filter by a list variable in dplyr?
E.g. a data frame where one variable is a list of different classes of object:
aa <- tibble(ss = c(1,2),
dd = list(NA,
matrix(data = c(1,2,3,4),
nrow = 2,
ncol = 2)))
> aa
# A tibble: 2 x 2
# ss dd
# <dbl> <list>
#1 1.00 <lgl [1]>
#2 2.00 <dbl [2 × 2]>
For example if I want to filter for logicals (though could be anything), if it were not a list it would be as simple as:
aa %>% filter(is.logical(dd))
But this returns
# A tibble: 0 x 2
# ... with 2 variables: ss <dbl>, dd <list>
Because it's not the first element that's a logical, it's the first element of the first element:
> is.logical(aa$dd[1])
# [1] FALSE
> is.logical(aa$dd[[1]])
# [1] TRUE
One may use purrr:map for other operations on nested list variables, but this also doesn't work.
> aa %>% filter(map(.x = dd,
+ .f = is.logical))
# Error in filter_impl(.data, quo) : basic_string::resize
What am I missing here?

As the 'dd' is a list column, we can loop through the 'dd' using map, but each element of 'dd' can have more than one element, so we make a condition that if all the elements are NA, then filter the rows of the dataset
library(tidyverse)
aa %>%
filter(map_lgl(dd, ~ .x %>%
is.na %>%
all))
# A tibble: 1 x 2
# ss dd
# <dbl> <list>
#1 1 <lgl [1]>
If this is about filtering based on class.
aa %>%
filter(map_lgl(dd, is.logical))
# A tibble: 1 x 2
# ss dd
# <dbl> <list>
#1 1 <lgl [1]>
In the OP's code, map output is still a list, we convert it to a logical vector with map_lgl

The best I can do is to create a dummy variable using is.logical with purrr:map, unlist it, filter by it, then un-select the dummy variable. Works, but what a kerfuffle.
aa %>%
mutate(ff = map(.x = dd,
.f = is.logical),
ff = unlist(ff)) %>%
filter(ff == TRUE) %>%
select(-ff)
# A tibble: 1 x 2
# ss dd
# <dbl> <list>
# 1 1.00 <lgl [1]>

dplyr: how to avoid hard coding variable names when I need them all?

Here is a simple example. The variables are only three, but could be many more. I would like a replacement for every c(X1,X2,X3) but can't find one.
library(dplyr)
library(MASS)
df <- data.frame(expand.grid(data.frame(matrix(rep(1:7,3),ncol=3))))
df1 <- df %>%
rowwise() %>%
filter(length(unique(c(X1,X2,X3)))==3)
df1 %>%
rowwise() %>%
filter(max(c(X1,X2,X3))- min(c(X1,X2,X3)) == 2) %>%
ungroup() %>%
summarise(res = n()/ nrow(df1)) %>%
unlist %>%
as.fractions

It really seems like everything() (newly fully exported) should do the trick, but it doesn't. Especially if you're going to be doing a lot of operations on all your columns, it may be worth it to make a list column with a vector of each row, on which you can easily call unique, max, etc. Here assembled with purrr, though you could do the same with apply(df, 1, list) %>% lapply(unlist):
library(purrr)
df1 <- df %>%
mutate(data = df %>% transpose() %>% map(unlist)) %>%
rowwise() %>%
filter(length(unique(data)) == 3)
df1
# Source: local data frame [210 x 4]
# Groups: <by row>
#
# X1 X2 X3 data
# <int> <int> <int> <list>
# 1 3 2 1 <int [3]>
# 2 4 2 1 <int [3]>
# 3 5 2 1 <int [3]>
# 4 6 2 1 <int [3]>
# 5 7 2 1 <int [3]>
# 6 2 3 1 <int [3]>
# 7 4 3 1 <int [3]>
# 8 5 3 1 <int [3]>
# 9 6 3 1 <int [3]>
# 10 7 3 1 <int [3]>
# .. ... ... ... ...
df1 %>%
rowwise() %>%
filter(max(data) - min(data) == 2) %>%
ungroup() %>%
summarise(res = n() / nrow(df1)) %>%
unlist %>%
as.fractions()
# res
# 1/7

We can do this also with data.table
library(data.table)
res <- setDT(df)[df[ ,uniqueN(unlist(.SD))==3 , 1:nrow(df)]$V1][,
sum(do.call(pmax, .SD)- do.call(pmin, .SD) ==2)/.N]
as.fractions(res)
#[1] 1/7
If we need to use dplyr
library(dplyr)
df1 <- df %>%
rowwise() %>%
do(data.frame(.,i1= n_distinct(unlist(.))==3)) %>%
filter(i1) %>%
dplyr::select(-i1)
df1 %>%
do(data.frame(., i2 = do.call(pmax, .) - do.call(pmin, .) == 2)) %>%
filter(i2) %>%
ungroup() %>%
summarise(n = n()/nrow(df1)) %>%
unlist %>%
as.fractions
# n
#1/7

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

purrr::map equivalent to dplyr::do - r

You could do the following: library(tidyverse) library(purrr) d %>% by_row(~data_frame(x = seq_len(.$n))) %>% unnest() by_row applies a function to each row, storing the result in nested tibbles. unnest is then used to remove the nesting and to concatenate the tibbles.

Using pmap() removes the need for nesting and unnesting. library(tidyverse) d %>% pmap_df(~data_frame(x = seq_len(.)))

Related

How to extract first value from lists in data.frames columns?

summarize to vector output

R collapse column to form numeric list

filter list variable in dplyr

dplyr: how to avoid hard coding variable names when I need them all?

Categories

Resources