I am teaching myself the R tidyverse purr() package and am having trouble implementing map() on a column of nested data frames. Could someone explain what I'm missing?
Using the base R ChickWeight dataset as an example I can easily get the number of observations for each timepoint under diet #1 if I first filter for diet #1 like so:
library(tidyverse)
ChickWeight %>%
filter(Diet == 1) %>%
group_by(Time) %>%
summarise(counts = n_distinct(Chick))
This is great but I would like to do it for each diet at once and I thought nesting the data and iterating over it with map() would be a good approach.
This is what I did:
example <- ChickWeight %>%
nest(-Diet)
Implementing this map function then achieves what I'm aiming for:
map(example$data, ~ .x %>% group_by(Time) %>% summarise(counts = n_distinct(Chick)))
However when I try and implement this same command using a pipe to put it in another column of the original data frame it fails.
example %>%
mutate(counts = map(data, ~ .x %>% group_by(Time) %>% summarise(counts = n_distinct(Chick))))
Error in eval(substitute(expr), envir, enclos) :
variable 'Chick' not found
Why does this occur?
I also tried it on the data frame split into a list and it didn't work.
ChickWeight %>%
split(.$Diet) %>%
map(data, ~ .x %>% group_by(Time) %>% summarise(counts = n_distinct(Chick)))
Because you're using dplyr non-standard evaluation inside of dplyr NSE, it's getting confused about what environment to search for Chick. It's probably a bug, really, but it can be avoided with the development version's new .data pronoun, which specifies where to look:
library(tidyverse)
ChickWeight %>%
nest(-Diet) %>%
mutate(counts = map(data,
~.x %>% group_by(Time) %>%
summarise(counts = n_distinct(.data$Chick))))
#> # A tibble: 4 × 3
#> Diet data counts
#> <fctr> <list> <list>
#> 1 1 <tibble [220 × 3]> <tibble [12 × 2]>
#> 2 2 <tibble [120 × 3]> <tibble [12 × 2]>
#> 3 3 <tibble [120 × 3]> <tibble [12 × 2]>
#> 4 4 <tibble [118 × 3]> <tibble [12 × 2]>
To pipe it through a list, leave the first parameter of map blank to pass in the list over which to iterate:
ChickWeight %>%
split(.$Diet) %>%
map(~ .x %>% group_by(Time) %>% summarise(counts = n_distinct(Chick))) %>% .[[1]]
#> # A tibble: 12 × 2
#> Time counts
#> <dbl> <int>
#> 1 0 20
#> 2 2 20
#> 3 4 19
#> 4 6 19
#> 5 8 19
#> 6 10 19
#> 7 12 19
#> 8 14 18
#> 9 16 17
#> 10 18 17
#> 11 20 17
#> 12 21 16
A simpler option would be to just group by both columns:
ChickWeight %>% group_by(Diet, Time) %>% summarise(counts = n_distinct(Chick))
#> Source: local data frame [48 x 3]
#> Groups: Diet [?]
#>
#> Diet Time counts
#> <fctr> <dbl> <int>
#> 1 1 0 20
#> 2 1 2 20
#> 3 1 4 19
#> 4 1 6 19
#> 5 1 8 19
#> 6 1 10 19
#> 7 1 12 19
#> 8 1 14 18
#> 9 1 16 17
#> 10 1 18 17
#> # ... with 38 more rows
Related
I'm trying to learn how to use nest(), and I'm trying to nest by once of 3 time periods participants could be in and I want to add two columns. The first column is the overall mean, which I have figured out. Then, I want to nest by the time variable and create 3 datasets (which I have figured out) and then compute the group mean. I read that you should create a function (here, section 6.3.1), but my function keeps failing. How would I do this?
Also, please use nest or nest_by in the solution. I know I could use group_by(), like someone else did here, but in my actual data, I need these to be 3 separate datasets due to other computations that I need to do.
#Here's my setup and sample data
library(dplyr)
library(purrr)
library(tidyr)
set.seed(1414)
test <- tibble(id = c(1:100),
condition = c(rep(c("pre", "post"), 50)),
time = c(case_when(condition == "pre" ~ 0,
condition == "post" ~ sample(c(1, 2), size = c(100), replace = TRUE))),
score = case_when(time == 0 ~ 1,
time == 1 ~ 10,
time == 2 ~ 100))
#Here's what I tried
#Nesting the data (works)
nested_test <- test %>%
unite(col = "all_combos", c(condition, time)) %>%
mutate(score2 = mean(score)) %>%
nest_by(all_combos)
#Make mean function and map it (doesn't work)
my_mean <- function(data) {
mean(score, na.rm = T)
}
nested_test %>%
mutate(score3 = map(data, my_mean))
We may need to ungroup as there is rowwise attribute and then loop over the data with map and create the column with mutate on the nested data
library(dplyr)
library(purrr)
nested_test_new <- nested_test %>%
ungroup %>%
mutate(data = map(data, ~ .x %>%
mutate(score3 = mean(score, na.rm = TRUE))))
-output
nested_test_new
# A tibble: 3 × 2
all_combos data
<chr> <list>
1 post_1 <tibble [19 × 4]>
2 post_2 <tibble [31 × 4]>
3 pre_0 <tibble [50 × 4]>
> nested_test_new$data
[[1]]
# A tibble: 19 × 4
id score score2 score3
<int> <dbl> <dbl> <dbl>
1 2 10 33.4 10
2 4 10 33.4 10
3 14 10 33.4 10
4 16 10 33.4 10
5 18 10 33.4 10
6 28 10 33.4 10
7 30 10 33.4 10
8 32 10 33.4 10
9 38 10 33.4 10
10 44 10 33.4 10
11 48 10 33.4 10
12 60 10 33.4 10
13 64 10 33.4 10
14 78 10 33.4 10
15 80 10 33.4 10
16 86 10 33.4 10
17 92 10 33.4 10
18 96 10 33.4 10
19 100 10 33.4 10
[[2]]
# A tibble: 31 × 4
id score score2 score3
<int> <dbl> <dbl> <dbl>
1 6 100 33.4 100
2 8 100 33.4 100
3 10 100 33.4 100
4 12 100 33.4 100
...
Or another option is nest_mutate from nplyr
library(nplyr)
test %>%
unite(col = "all_combos", c(condition, time)) %>%
mutate(score2 = mean(score)) %>%
nest(data = -all_combos) %>%
nest_mutate(data, score3 = mean(score, na.rm = TRUE))
-output
# A tibble: 3 × 2
all_combos data
<chr> <list>
1 pre_0 <tibble [50 × 4]>
2 post_1 <tibble [19 × 4]>
3 post_2 <tibble [31 × 4]>
We do a normal nesting grouping by rows. Mine is different.
I want to create a nested tibble grouping by column prefixes (before the first '_'), preserving the original column names in the nested tibbles.
The current approach works but looks overcomplicated.
tibble(a_1=1:3, a_2=2:4, b_1=3:5) %>%
print() %>%
# A tibble: 3 x 3
# a_1 a_2 b_1
# <int> <int> <int>
# 1 1 2 3
# 2 2 3 4
# 3 3 4 5
pivot_longer(everything()) %>%
nest(data=-name) %>%
mutate(data=map2(data, name, ~rename(.x, '{.y}' := value))) %>%
mutate(gr=str_extract(name, '^[^_]+'), .keep='unused') %>%
nest(data=-gr) %>%
mutate(data=map(data, ~bind_cols(.[[1]]))) %>%
print() %>%
# A tibble: 2 x 2
# gr data
# <chr> <list>
# 1 a <tibble [3 x 2]>
# 2 b <tibble [3 x 1]>
{ .$data[[1]] }
# A tibble: 3 x 2
# a_1 a_2
# <int> <int>
# 1 1 2
# 2 2 3
# 3 3 4
UPD: if possible, tidyverse solution
Using a neat little trick I learned lately you could do:
library(tidyr)
library(dplyr, warn = FALSE)
tibble(a_1 = 1:3, a_2 = 2:4, b_1 = 3:5) %>%
split.default(., gsub("_[0-9]", "", names(.))) %>%
lapply(nest, data = everything()) %>%
bind_rows(.id = "gr")
#> # A tibble: 2 × 2
#> gr data
#> <chr> <list>
#> 1 a <tibble [3 × 2]>
#> 2 b <tibble [3 × 1]>
Another possible solution, based on purrr::map_dfr:
library(tidyverse)
map_dfr(unique(str_remove(names(df), "_\\d+")),
~ tibble(gr = .x, nest(select(df, which(str_detect(names(df), .x))),
data = everything())))
#> # A tibble: 2 × 2
#> gr data
#> <chr> <list>
#> 1 a <tibble [3 × 2]>
#> 2 b <tibble [3 × 1]>
my version, a little more modified, tidyversed version of stepan's answer
tibble(a_1 = 1:3, a_2 = 2:4, b_1 = 3:5) %>%
split.default(str_extract(names(.), "^[^_]+")) %>%
map(nest, data = everything()) %>%
bind_rows(.id = "gr")
Couldn't find an alternative to split.default()
i am a beginner in programming. I have a table in which each rows is an order (variables : id_customer and date). I want to set a function that calculates for each month the number of customers that have made an order within 7 days. How can i do this ?
This is the output of my data :
id_customer
jour_commande
7
12-05-2021
10
13-07-2021
18
17-07-2021
enter image description here
I have tried this, it's only for time difference between two orders for each customers:
data %>%
arrange(id_customer,jour_Commande) %>%
mutate(diff = jour_Commande - lag(jour_Commande)) %>%
group_by(id_customer,jour_Commande)
the first customer it goes well but for the others i have negative numbers.
can someone help me on this ?
thanks in advance !
Here's an attempt. I'm using the storms data set from the dplyr package as a substitute because it's too much work to type in all the data in the screenshot.
library(lubridate)
library(dplyr)
mydat <- storms %>%
mutate(date = ymd(paste(year, month, day, sep = "-"))) %>%
select(name, status, date) %>%
distinct(name, status, .keep_all = TRUE)
mydat
# A tibble: 513 x 3
name status date
<chr> <chr> <date>
1 Amy tropical depression 1975-06-27
2 Amy tropical storm 1975-06-29
3 Caroline tropical depression 1975-08-24
4 Caroline tropical storm 1975-08-29
5 Caroline hurricane 1975-08-30
6 Doris tropical storm 1975-08-29
7 Doris hurricane 1975-08-31
8 Belle tropical depression 1976-08-06
9 Belle tropical storm 1976-08-07
10 Belle hurricane 1976-08-07
# ... with 503 more rows
mydat %>%
arrange(name, date) %>%
group_by(name) %>%
mutate(diff = date - lag(date),
within_7 = if_else(diff <= 7, TRUE, FALSE)) %>%
ungroup() %>%
filter(!is.na(within_7)) %>%
count(within_7)
# A tibble: 2 x 2
within_7 n
<lgl> <int>
1 FALSE 50
2 TRUE 249
EDIT: Looping through month-by-month
library(tidyr)
library(purrr)
mydat %>%
mutate(month = month(date),
year = year(date)) %>%
group_by(year, month) %>%
nest() %>%
mutate(count_data = map(data, ~ .x %>%
arrange(name, date) %>%
group_by(name) %>%
mutate(diff = date - lag(date),
within_7 = if_else(diff <= 7, TRUE, FALSE)) %>%
ungroup() %>%
filter(!is.na(within_7)) %>%
count(within_7))) %>%
unnest(count_data) %>%
ungroup()
# A tibble: 98 x 5
month year data within_7 n
<dbl> <dbl> <list> <lgl> <int>
1 6 1975 <tibble [2 x 3]> TRUE 1
2 8 1975 <tibble [5 x 3]> TRUE 3
3 8 1976 <tibble [3 x 3]> TRUE 2
4 9 1976 <tibble [3 x 3]> TRUE 2
5 8 1977 <tibble [3 x 3]> TRUE 2
6 9 1977 <tibble [3 x 3]> TRUE 2
7 10 1977 <tibble [3 x 3]> TRUE 2
8 7 1978 <tibble [2 x 3]> TRUE 1
9 8 1978 <tibble [5 x 3]> TRUE 3
10 10 1978 <tibble [2 x 3]> TRUE 1
# ... with 88 more rows
I have a dataset with daily counts per year spanning several decades, and I'd like to run a function on different subsets of that data based on an increasing timespan. For example, I'd like to run the function on the first decade of data (1995-2005), then on the first decade + 1 (1995-2006), first decade + 2 (1995-2007), and so on until the end of the time series. This is what I had in mind:
dat <- tibble(
year = rep(1995:2014, each = 30),
count = rpois(600, 5)
)
dat
# A tibble: 600 x 2
year count
<int> <int>
1 1995 8
2 1995 3
3 1995 9
4 1995 2
5 1995 8
6 1995 7
7 1995 3
8 1995 6
9 1995 1
10 1995 7
# … with 590 more rows
with the final product looking like this:
# A tibble: 3 x 2
time_span data
<chr> <list>
1 1995-2004 <tibble [300 × 1]>
2 1995-2005 <tibble [330 × 1]>
3 1995-2006 <tibble [360 × 1]>
...
I would then apply my function to the nested data frame:
dat_nested %>%
mutate(result = map(data, my_function))
I'm struggling to think of a way to create these subsets with dplyr...any suggestions? Thanks!
Here's a way using map :
library(dplyr)
n <- min(dat$year)
purrr::map_df((n+10):max(dat$year),
~dat %>%
filter(between(year, n, .x)) %>%
summarise(year = paste(min(year), max(year), sep = '-'),
data = list(count)))
#If you want dataframe
#data = list(data.frame(count = count))))
# year data
# <chr> <list>
# 1 1995-2005 <int [330]>
# 2 1995-2006 <int [360]>
# 3 1995-2007 <int [390]>
# 4 1995-2008 <int [420]>
# 5 1995-2009 <int [450]>
# 6 1995-2010 <int [480]>
# 7 1995-2011 <int [510]>
# 8 1995-2012 <int [540]>
# 9 1995-2013 <int [570]>
#10 1995-2014 <int [600]>
The result could be directly calculated from the original data frame without the need of an intermediate nested data frame and we show that below; however, if you do want to create a nested data frame anyways then use the same code but use it with
my_function <- base::list
to nest the two columns or with
my_function <- function(x) list(x["count"])
to just nest the count column. The solution only uses dplyr. It does not use tidyr or purrr.
library(dplyr)
my_function <- function(x) sum(x$count) # test function
dat %>%
group_by(year) %>%
summarize(result = my_function(.[.$year <= first(year), ]), .groups = "drop") %>%
mutate(year = paste(first(year), year, sep = "-")) %>%
tail(-9)
giving:
# A tibble: 11 x 2
year result
<chr> <int>
1 1995-2004 1502
2 1995-2005 1647
3 1995-2006 1810
4 1995-2007 1957
5 1995-2008 2106
6 1995-2009 2258
7 1995-2010 2398
8 1995-2011 2547
9 1995-2012 2697
10 1995-2013 2855
11 1995-2014 3016
With my_function <- function(x) list(x["count"]) the output looks like this:
# A tibble: 11 x 2
year result
<chr> <list>
1 1995-2004 <tibble [300 x 1]>
2 1995-2005 <tibble [330 x 1]>
3 1995-2006 <tibble [360 x 1]>
4 1995-2007 <tibble [390 x 1]>
5 1995-2008 <tibble [420 x 1]>
6 1995-2009 <tibble [450 x 1]>
7 1995-2010 <tibble [480 x 1]>
8 1995-2011 <tibble [510 x 1]>
9 1995-2012 <tibble [540 x 1]>
10 1995-2013 <tibble [570 x 1]>
11 1995-2014 <tibble [600 x 1]>
Note
The test input dat in reproducible form is:
set.seed(123)
dat <- data.frame(year = rep(1995:2014, each = 30), count = rpois(600, 5))
Here is my attempt to create a nested data with time-series data on a rolling window basis. (note: rlang usage of var=enquo(str_varname) with !!var may change in the future versions.)
library(dplyr)
library(tidyr)
create_rolling_yr_data <- function(df, year='year', rolling=9,
var_list=c('count'), newvar='rolling') {
year <- enquo(year)
var_list <- enquos(var_list)
df <- df %>% dplyr::select(!!year, !!!var_list)
df_nest <- df %>% group_by(year) %>% nest()
print(df_nest)
list_data <- list()
yrs <- unique(df[[ensym(year)]])
yr_end <- max(yrs) - rolling
for (i in seq_along(yrs)) {
yr <- yrs[i]
if (yr <= yr_end) {
list_data[[i]] <- df %>% filter(year >= yr, year <= (yr+rolling))
} else {
list_data[[i]] <- list()
}
}
df_nest[[newvar]] <- list_data
return(df_nest %>% filter(year <= yr_end))
}
create_rolling_yr_data(dat, year='year', rolling=9,
var_list=c('count'), newvar='rolling')
So, I've checked multiple posts and haven't found anything. According to this, my code should work, but it isn't.
Objective: I want to essentially print out the number of subjects--which in this case is also the number of rows in this tibble.
Code:
data<-read.csv("advanced_r_programming/data/MIE.csv")
make_LD<-function(x){
LongitudinalData<-x%>%
group_by(id)%>%
nest()
structure(list(LongitudinalData), class = "LongitudinalData")
}
print.LongitudinalData<-function(x){
paste("Longitudinal dataset with", x[["id"]], "subjects")
}
x<-make_LD(data)
print(x)
Here's the head of the dataset I'm working on:
> head(x)
[[1]]
# A tibble: 10 x 2
id data
<int> <list>
1 14 <tibble [11,945 x 4]>
2 20 <tibble [11,497 x 4]>
3 41 <tibble [11,636 x 4]>
4 44 <tibble [13,104 x 4]>
5 46 <tibble [13,812 x 4]>
6 54 <tibble [10,944 x 4]>
7 64 <tibble [11,367 x 4]>
8 74 <tibble [11,517 x 4]>
9 104 <tibble [11,232 x 4]>
10 106 <tibble [13,823 x 4]>
Output:
[1] "Longitudinal dataset with subjects"
I've tried every possible combination from the aforementioned stackoverflow post and none seem to work.
Here are two options:
library(tidyverse)
# Create a nested data frame
dat = mtcars %>%
group_by(cyl) %>%
nest %>% as.tibble
cyl data
1 6 <tibble [7 x 10]>
2 4 <tibble [11 x 10]>
3 8 <tibble [14 x 10]>
dat %>%
mutate(nrow=map_dbl(data, nrow))
dat %>%
group_by(cyl) %>%
mutate(nrow = nrow(data.frame(data)))
cyl data nrow
1 6 <tibble [7 x 10]> 7
2 4 <tibble [11 x 10]> 11
3 8 <tibble [14 x 10]> 14
There is a specific function for this in the tidyverse: n()
You can simply do: mtcars %>% group_by(cyl) %>% summarise(rows = n())
> mtcars %>% group_by(cyl) %>% summarise(rows = n())
# A tibble: 3 x 2
cyl rows
<dbl> <int>
1 4 11
2 6 7
3 8 14
In more sophisticated cases, in which subjects may span across multiple rows ("long format data"), you can do (assuming hp denotes the subject):
> mtcars %>% group_by(cyl, hp) %>% #always group by subject-ID last
+ summarise(n = n()) %>% #observations per subject and cyl
+ summarise(n = n()) #subjects per cyl (implicitly summarises across all group-variables except the last)
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 3 x 2
cyl n
<dbl> <int>
1 4 10
2 6 4
3 8 9
Note that the n in the last case is smaller than in the first because there are cars with same amount of cyl and hp that are now counted as just one "subject".