Acceptable practice to use 'group_by' stats in mutate?

Acceptable practice to use 'group_by' stats in mutate? - r

In the past, when I've needed to create a new variable in an R data frame that is partly based on a 'group_by' summary statistic, I've always used the following sequence:
(1) calculate 'group stats' from data in the base (ungrouped) data frame using group_by() and summarize()
(2) join the base data frame with the result of the previous step, then calculate the new variable value using mutate.
However, (after years of using dplyr!) I accidentally did the 'summarizing' in a mutate step and everything seemed to work. This is illustrated in Option #2 in the code snippet below. I'm assuming Option #2 is okay because I'm getting identical results using both options, and because I found similar examples searching the web today. However, I wasn't sure.
Is Option #2 acceptable practice, or is Option #1 preferred (and if so why)?
set.seed(123)
df <- tibble(year_ = c(rep(c(2019), 4), rep(c(2020), 4)),
qtr_ = c(rep(c(1,2,3,4), 2)),
foo = sample(seq(1:8)))
# Option 1: calc statistics then rejoin with input data
df_stats <- df %>%
group_by(year_) %>%
summarize(mean_foo = mean(foo))
df_with_stats <- left_join(df, df_stats) %>%
mutate(dfoo = foo - mean_foo)
# Option 2: everything in one go
df_with_stats2 <- df %>%
group_by(year_) %>%
mutate(mean_foo = mean(foo),
dfoo = foo - mean_foo)
df_with_stats
# A tibble: 8 x 5
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
df_with_stats2
# A tibble: 8 x 5
# Groups: year_ [2]
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2

Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement:
df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo))
also, data.table
setDT(df)[,dfoo:=foo-mean(foo), by =year_]

Related

How to use dplyr mutate function in R to calculate a running balance?

In the MWE code at the bottom, I'm trying to generate a running balance for each unique id when running from one row to the next. For example, when running the below code the output should be:
data2 <-
id plusA plusB minusC running_balance [desired calculation for running balance]
1 3 5 10 -2 3 + 5 - 10 = -2
2 4 5 9 0 4 + 5 - 9 = 0
3 8 5 8 5 8 + 5 - 8 = 5
3 1 4 7 3 id doesn't change so 5 from above + (1 + 4 - 7) = 3
3 2 5 6 4 id doesn't change so 3 from above + (2 + 5 - 6) = 4
5 3 6 5 4 3 + 6 - 5 = 4
The below MWE refers to, when id is consistent from one row to the next, the prior row plusA amount rather than the prior row running_balance amount. I've tried changing the below to some form of lag(running_balance...) without luck yet.
I'm trying to minimize the use of too many packages. For example I understand the purrr package offers an accumulate() function, but I'd rather stick to only dplyr for now. Is there a simple way to do this, using dplyr mutate() in my case? I also tried fiddling around with the dplyr cumsum() function which should work here but I'm unsure of how to string several of them together.
MWE code:
data <- data.frame(id=c(1,2,3,3,3,5),
plusA=c(3,4,8,1,2,3),
plusB=c(5,5,5,4,5,6),
minusC = c(10,9,8,7,6,5))
library(dplyr)
data2<- subset(
data %>% mutate(extra=case_when(id==lag(id) ~ lag(plusA), TRUE ~ 0)) %>%
mutate(running_balance=plusA+plusB-minusC+extra),
select = -c(extra)
)

Using dplyr:
data %>%
mutate(running_balance = plusA + plusB - minusC) %>%
group_by(id) %>%
mutate(running_balance = cumsum(running_balance)) %>%
ungroup()
Output:
# A tibble: 6 x 5
# Groups: id [4]
id plusA plusB minusC running_balance
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5 10 -2
2 2 4 5 9 0
3 3 8 5 8 5
4 3 1 4 7 3
5 3 2 5 6 4
6 5 3 6 5 4

How to avoid for-loop or how to update the results obtained by purrr::map dynamically during the iteration?

Coming from Stata I am sometimes still struggling with the different programming approach of R. In particular when it comes to avoiding for loops.
In the example below, I have written two functions which overwrite the original values of ex$status1' and ex$status2'. For every id the original values of the two variables should be replaced with x if there is any occurrence of xwithin the respective id.
The function myfunc2 is perfectly capable of performing this task for several variables (in the example below: status1 and status2).
My problem, however, occurs when trying to impose a sequential order of replacing the initial values. The order is given as c(1,5,3,7). That is, if value 1 is observed for a given id all the values of the variable for this id should be replaced by 1. Then the procedure should be repeated on the updated data for the remaining values of c(1,5,3,7). I accomplished this task with a for-loop, but failed to do it using one of purrr's map functions because these functions always were executed on the original tibble and did not update the tibble sequentially (see code below). Can anyone show me, how to obtain the desired result with a map function (or simply without using the for loop)?
ex <- tibble(id = c(1,1,1,1,2,2,2),
status1 = c(3,3,5,7,1,5,7),
status2 = c(3,3,3,7,7,5,7))
ex
myfunc <- function(df, id, var, val) {
df <- df %>%
group_by(id) %>%
mutate({{var}} := case_when(any({{var}} == val) ~ val,
TRUE ~ {{var}})) %>%
ungroup() %>%
select({{var}})
return(df)
}
myfunc(ex, id, status1, 1)
myfunc2 <- function(df, id, var, val) {
map_dfc(var,
~myfunc(df, id, !!sym(.x), val)) %>%
add_column(id = df$id, .before = 1)
}
myfunc2(ex, id, c("status1", "status2"), 1)
# this works
for (i in c(1,5,3,7)) {
ex <- myfunc2(ex, id, c("status1", "status2"), i)
}
# this does not work
c(1,5,3,7) %>%
map_dfc(function(x) {ex <- myfunc2(ex, id, c("status1", "status2"), x)})
# original data
# A tibble: 7 x 3
id status1 status2
<dbl> <dbl> <dbl>
1 1 3 3
2 1 3 3
3 1 5 3
4 1 7 7
5 2 1 7
6 2 5 5
7 2 7 7
# Data after executing the for-loop
# A tibble: 7 x 3
id status1 status2
<dbl> <dbl> <dbl>
1 1 5 3
2 1 5 3
3 1 5 3
4 1 5 3
5 2 1 5
6 2 1 5
7 2 1 5

lapply, map loops on each of the input elements and return the output but it won't update the original object recursively as in a for loop. If we want to do that, then have to do a scoping update with <<- which may not be the best option. Would recommend the for loop
library(dplyr)
library(purrr)
c(1,5,3,7) %>%
map_dfc(function(x) {
ex <<- myfunc2(ex, id, c("status1", "status2"), x)
})
Now, we check the object 'ex'
ex
# A tibble: 7 x 3
# id status1 status2
# <dbl> <dbl> <dbl>
#1 1 5 3
#2 1 5 3
#3 1 5 3
#4 1 5 3
#5 2 1 5
#6 2 1 5
#7 2 1 5
With tidyverse, we could use reduce to do this instead of map and <<-
reduce(list(1, 5, 3, 7),
~myfunc2(.x, id, c("status1", "status2"), .y), .init = ex)
# A tibble: 7 x 3
# id status1 status2
# <dbl> <dbl> <dbl>
#1 1 5 3
#2 1 5 3
#3 1 5 3
#4 1 5 3
#5 2 1 5
#6 2 1 5
#7 2 1 5
which is similar to the base R Reduce
Reduce(function(x, y) myfunc2(x, id, c("status1", "status2"), y),
list(1, 5, 3, 7), init = ex)
# A tibble: 7 x 3
# id status1 status2
# <dbl> <dbl> <dbl>
#1 1 5 3
#2 1 5 3
#3 1 5 3
#4 1 5 3
#5 2 1 5
#6 2 1 5
#7 2 1 5
One advantage with these approaches is to avoid the side-effect i.e. we don't have to update the original object
ex
# A tibble: 7 x 3
# id status1 status2
# <dbl> <dbl> <dbl>
#1 1 3 3
#2 1 3 3
#3 1 5 3
#4 1 7 7
#5 2 1 7
#6 2 5 5
However, considering the simplicity of for loop (in understanding and executing), it may be better with for loop (subjective opinion)

How should a function be applied by row on a dataframe to generate a new or expanded dataframe in r

I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.

If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))

well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows

do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))

We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows

Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})

Create multiple new dataframes based on rows in another dataframe with a for loop in r

I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop

Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))

library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)

Sampling where number of samples per cluster varies in R

I have a dataframe,
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10,11),score=c(1,3,5,7,3,4,7,1,2,6,3),cluster=c(1,1,2,2,2,2,3,3,3,3,3))
I also have a set of cluster IDs and the number of samples I'd like from each cluster,
sample_sizes<-data.frame(cluster=c(1,2,3),samples=c(1,3,2))
I would like to have a samples dataframe consisting of samples selected according to the number of samples specified in the sample_sizes dataframe.
For instance, the following table would be a potential result:
id score cluster
2 3 1
3 4 2
5 3 2
6 4 2
9 2 3
11 3 3
I have looked at using the following using dplyr:
df2<-merge(df,sample_sizes)
df3<-df2 %>%
group_by(cluster) %>%
sample_n(samples)
but receive an error.
Is there a best method for doing this? A solution that could scale with larger numbers of clusters and samples would be ideal.
Thank you in advance!

We may use map2_df along with split:
map2_df(split(df, df$cluster), sample_sizes$samples, sample_n)
# id score cluster
# 1 1 1 1
# 2 4 7 2
# 3 5 3 2
# 4 3 5 2
# 5 7 7 3
# 6 9 2 3
split(df, df$cluster) gives a list of data frames, one for each cluster, then map2_df applies sample_n to each cluster, just like you intended, and binds the resulting data frames into one.

Here is a way using tidyr::nest() and purrr::map2
library(tidyverse)
df %>% group_by(cluster) %>% nest() %>%
left_join(sample_sizes) %>% mutate(samp=map2(data,samples,sample_n)) %>%
select(cluster,samples,samp) %>% unnest()
Joining, by = "cluster"
# A tibble: 6 x 4
cluster samples id score
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 3 5 3
3 2 3 6 4
4 2 3 4 7
5 3 2 8 1
6 3 2 10 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Acceptable practice to use 'group_by' stats in mutate? - r

Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement: df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo)) also, data.table setDT(df)[,dfoo:=foo-mean(foo), by =year_]

Related

How to use dplyr mutate function in R to calculate a running balance?

How to avoid for-loop or how to update the results obtained by purrr::map dynamically during the iteration?

How should a function be applied by row on a dataframe to generate a new or expanded dataframe in r

Create multiple new dataframes based on rows in another dataframe with a for loop in r

Sampling where number of samples per cluster varies in R

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Acceptable practice to use 'group_by' stats in mutate? - r

Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement: df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo)) also, data.table setDT(df)[,dfoo:=foo-mean(foo), by =year_]

Related

How to use dplyr mutate function in R to calculate a running balance?

How to avoid for-loop *or* how to update the results obtained by purrr::map dynamically during the iteration?

How should a function be applied by row on a dataframe to generate a new or expanded dataframe in r

Create multiple new dataframes based on rows in another dataframe with a for loop in r

Sampling where number of samples per cluster varies in R

Categories

Resources

How to avoid for-loop or how to update the results obtained by purrr::map dynamically during the iteration?